2.9. Understanding cluster rebuilding

The storage cluster is self-healing. If a node or disk fails, a cluster will automatically try to restore the lost data, i.e. rebuild itself.

The rebuild process consists of several steps. Every CS sends a heartbeat message to an MDS every 5 seconds. If a heartbeat is not sent, the CS is considered inactive and the MDS informs all cluster components that they stop requesting operations on its data. If no heartbeats are received from a CS for 15 minutes, the MDS considers that CS offline and starts cluster rebuilding (if the prerequisites below are met). In the process, the MDS finds CSes that do not have pieces (replicas) of the lost data and restores the data—one piece (replica) at a time—as follows:

  • If replication is used, the existing replicas of a degraded chunk are locked (to make sure all replicas remain identical) and one is copied to the new CS. If at this time a client needs to read some data that has not been rebuilt yet, it reads any remaining replica of that data.
  • If erasure coding is used, the new CS requests almost all of the remaining data pieces to rebuild the missing ones. If at this time a client needs to read some data that has not yet been rebuilt, that data is rebuilt out of turn and then read.

Note

If a node or disk goes offline during maintenance, cluster self-healing is delayed, to save cluster resources. The default delay is 30 minutes. You can adjust it by setting the mds.wd.offline_tout_mnt parameter, in milliseconds, with the vstorage -c <cluster_name> set-config command.

Self-healing requires more network traffic and CPU resources if replication is used. On the other hand, rebuilding with erasure coding is slower.

For a cluster to be able to rebuild itself, it must have at least:

  1. As many healthy nodes as required by the redundancy mode
  2. Enough free space to accommodate as much data as any one node can store

The first prerequisite can be explained on the following example. In a cluster that works in the 5+2 erasure coding mode and has seven nodes (i.e. the minimum), each piece of user data is distributed to 5+2 nodes for redundancy, i.e. each node is used. If one or two nodes fail, the user data will not be lost, but the cluster will become degraded and will not be able to rebuild itself until at least seven nodes are healthy again (that is, until you add the missing nodes). For comparison, in a cluster that works in the 5+2 erasure coding mode and has ten nodes, each piece of user data is distributed to the random 5+2 nodes out of ten, to even out the load on CSes. If up to three nodes fail, such a cluster will still have enough nodes to rebuild itself.

The second prerequisite can be explained on the following example. In a cluster that has ten 10 TB nodes, at least 1 TB on each node should be kept free, so if a node fails, its 9 TB of data can be rebuilt on the remaining nine nodes. If, however, a cluster has ten 10 TB nodes and one 20 TB node, each smaller node should have at least 2 TB free in case the largest node fails (while the largest node should have 1 TB free).

Two recommendations that help smooth out rebuilding overhead:

  • To simplify rebuilding, keep uniform disk counts and capacity sizes on all nodes.
  • Rebuilding places additional load on the network, and increases the latency of read and write operations. The more network bandwidth the cluster has, the faster rebuilding will be completed and bandwidth freed up.