2.9. Understanding cluster rebuilding¶
存储簇可自我修复。如果节点或磁盘出现故障,簇将自动尝试恢复丢失的数据,即,重建自身。
重建过程包括几个步骤。每个 CS 将每隔 5 秒钟将一条反馈消息发送给 MDS。如果未发送反馈,则 CS 将被认为“不活动”,MDS 会通知所有簇组件,它们停止请求对其数据的操作。如果 15 分钟内未收到来自 CS 的反馈,MDS 会考虑 CS 脱机,并开始簇重建(如果满足下面的先决条件)。在此过程中,MDS 会查找没有丢失数据片段(副本)的 CS,并恢复数据(一次一个片段(副本)),如下所示:
如果使用了复制,则降级区块的现有副本将锁定(以确保所有副本保持一致),并且其中一个被复制到新 CS。如果在此时,客户端需要读取某些还未重建的数据,它将读取该数据任意剩余的副本。
If erasure coding is used, the new CS requests almost all of the remaining data pieces to rebuild the missing ones. If at this time a client needs to read some data that has not yet been rebuilt, that data is rebuilt out of turn and then read.
注解
If a node or disk goes offline during maintenance, cluster self-healing is delayed, to save cluster resources. The default delay is 30 minutes. You can adjust it by setting the mds.wd.offline_tout_mnt
parameter, in milliseconds, with the vstorage -c <cluster_name> set-config
command.
如果使用了复制,自我修复需要更多的网络流量和 CPU 资源。另一方面,使用擦除代码进行重建将更慢。
若使簇能够重建自身,它必须至少具备:
冗余模式所需的正常运行的节点数
足够的空闲空间以容纳任意一个节点可存储的数据量
The first prerequisite can be explained on the following example. In a cluster that works in the 5+2 erasure coding mode and has seven nodes (i.e. the minimum), each piece of user data is distributed to 5+2 nodes for redundancy, i.e. each node is used. If one or two nodes fail, the user data will not be lost, but the cluster will become degraded and will not be able to rebuild itself until at least seven nodes are healthy again (that is, until you add the missing nodes). For comparison, in a cluster that works in the 5+2 erasure coding mode and has ten nodes, each piece of user data is distributed to the random 5+2 nodes out of ten, to even out the load on CSes. If up to three nodes fail, such a cluster will still have enough nodes to rebuild itself.
第二个先决条件在以下示例中进行了解释。在具有 10 TB 节点的簇上,每个节点至少有 1 TB 应保持空闲,因此如果一个节点发生故障,可以在剩余的九个节点上重建其 9 TB 的数据。然而,如果簇有十个 10 TB 的节点和一个 20 TB 的节点,每个较小的节点应至少有 2 TB 的空闲空间,以免最大的节点发生故障(尽管最大的节点应有 1 TB 的空闲空间)。
有两个建议可帮助消除重建开销:
要简化重建,请在所有节点上保留统一的磁盘数和容量大小。
Rebuilding places additional load on the network, and increases the latency of read and write operations. The more network bandwidth the cluster has, the faster rebuilding will be completed and bandwidth freed up.