Recap for 12/11/2022 Khala Node Incident

Hi Phamily!

A critical bug occurred on our network nodes on Saturday, causing some PRB workers not to work properly. In this post, we would like to record and recap the specific incident, including the background, the solution, the processing progress so far…etc.

The following is our official announcement to the community and partners that day, clearly describing the situation at that time.

Incident Announcement to the community

We have encountered a critical bug with the Kusama node which caused database corruption at a certain height but the node itself doesn’t have a critical error prompt.

# Technical background

Around 2022-11-12 11:28 UTC, the Kusama relay chain met a pre-commit equivocation problem at #15294837, a large amount of Khala nodes got incorrect justification that PRuntime refused to accept, and the outcome is PRuntime computing is stuck.

PRB will compact blocks into a data pack, and PRuntime will validate the last justification, coincidentally, the last justification is just the bad block.

The problem relates to https://github.com/paritytech/substrate/issues/11175 , there’s a fix included in Polkadot-v0.9.26, we’re keeping in touch with Parity about the situation.

#Patterns of the issue

1. Node keeps crashing, the log contains

[Relaychain] subsystem exited with error subsystem="dispute-coordinator-subsystem" err=FromOrigin { origin: "dispute-coordinator", source: SubsystemReceive(Generated(Context("Signal channel is terminated and empty."))) }
Error: Service(Other("Essential task failed."))

##Solution

Your khala-node maybe too old, try to upgrade to the latest version

2. Workers in PRB all errored, with the message

{"payload":"{\"message\":\"bad justification for header: invalid commit in grandpa justification\"}","signature":"027f36a74f948c6a87c9d0c13a60a9aaab9fd35763b1d02622c68f976d10803e66640a5268e7e09de18630c7ca734866b0b7810c0a37c9db800edb567d7e6a8b","status":"error"}

##Solution

We prepared a reference script that helps PRuntime skip the problem block, and after run the script, restart PRB Lifecycle manager service.

wget https://raw.githubusercontent.com/Phala-Network/phala-blockchain/master/scripts/fix_pruntime_justification_error.sh
chmod +x fix_pruntime_justification_error.sh

You need to edit the file to correct ENV before running it

sudo ./fix_pruntime_justification_error.sh
Restart PRB

We’re keeping tracking the issues, and updating the script in the next period of time. If the solution doesn’t work for you, welcome to put your issue at :pick:|miner channel.

==========================

Impact of the incident

  • During the 10800 blocks after the faulty block, there were 9915 workers entering the “unresponsive” state
  • Up to 7469 workers are in the “unresponsive” state at the same time

==========================

Incident Timeline Recap

  • Occurrence time: 2022-11-12 11:28:36 (+UTC)
    Khala block - #2,702,763
    Kusama block - #15,294,834

  • 1st Report about this issue from the Community: 2022-11-12T 12:55

  • Event disposal time point by project team

    1. 2022-11-12 T 11:34 Incident discovery by Dev team.
    2. 2022-11-12 T 11:51 Incident confirmation by Dev teams.
    3. 2022-11-12 T 11:59 Escalation alarm by Dev team, all developers and key team members of Phala are required to join the incident channel and assist.
    4. 2022-11-12 T 13:07 Locate the issue and make an announcement to the computing community.
    5. 2022-11-12 T 14:38 The incident caused pools’ APR decrease steeply and withdrawals increased, so we choose to lead a reminder ANN banner on the Phala App to the delegators to describe the incident background.
    6. 2022-11-12 T 15:06 The dev team found a further solution and started deploying the code.
    7. 2022-11-12 T 15:52 The solution code was deployed and tested.
    8. 2022-11-12 T 17:24 The test passed and was shared the solution in an announcement to the community and partners.
  • 60% of the affected workers were fixed on 2022-11-12 20:19:19

  • 95% of the affected workers were fixed on 2022-11-13 02:04:00

==========================

The Progress and incident conclusion so far

After the communication with Parity, the team realized the root cause of the problem:

  • In a recent Substrate upgrade (Polkadot v0.9.26), it has introduced a change in its consensus protocol.
  • The upgrade introduced a new mechanism called “Grandpa equivocation”. It extends the block justification format to include additional data.
  • The Phala offchain client (pruntime) follows the blockchain by a light client, which verifies the Grandpa block justification from the blockchain. The format of the block justification has been used for years, and thus we considered it stable.
  • However, after the Substrate (Polkadot v0.9.26) upgrade, the block data may come with the additional data which is not recognized by pruntime. When the new format block data arrives, pruntime will reject the block and complain “invalid grandpa justification”. From the compute resource provider perspective, they will find the pruntime stop syncing the blockchain.

Therefore we conclude a full fix would be to upgrade pruntime to support the new block justification format. This may take one week given the complexity of the pruntime upgrade deployment.

Before the full fix, we expect the impact of the incident is relative low. The new block justification is still compatible with the old pruntime, as long as there’s no Grandpa equivocation in the block data, which is usually a rare case. That’s also why the problem is not captured by the test process of Phala release, and only happen weeks after the Kusama v0.9.26 upgrade.

On the other hand, the pruntime syncing process doesn’t always stuck when there’s equivocation block. pruntime only verifies the last block in a batch. The worker will only stuck if the block with Grandpa equivocation is located at the end of a block syncing batch, which makes the impact of the equivocation smaller. PRB syncs the blocks with fixed batch size, while the solo script syncs the blocks with dynamic batch size. Given the equivocation is rare, the solo script is likely to bypass the affected block when the next block arrives. It further lowers down the impact of the incident. (The temporary PRB mitigation script uses pherry to manually bypass the block with equivocation to walk around the problem.)

==========================

As mentioned above, we are still working on the development work of pruntime upgrade, as a result It will lead a pruntime upgrade after the development. At the same time, we are still further recapping this incident and will update the new information in this post in time.

3 Likes