.. |_| unicode:: 0xA0 :trim: .. _Planning Node Hardware Configurations: Planning Node Hardware Configurations ------------------------------------- |product_name| works on top of commodity hardware, so you can create a cluster from regular servers, disks, and network cards. Still, to achieve the optimal performance, a number of requirements must be met and a number of recommendations should be followed. .. only:: ac .. note:: If you are unsure of what hardware to choose, consult your sales representative. You can also use the `online hardware calculator `__. .. only:: vz .. note:: If you are unsure of what hardware to choose, consult your sales representative. You can also use the `online hardware calculator `__. .. _Hardware Limits: Hardware Limits ~~~~~~~~~~~~~~~ The following table lists the current hardware limits for |product_name| servers: .. tabularcolumns:: |>{\TL}\X{2}{15}% |>{\TL}\X{3}{15}% |>{\TL}\X{3}{15}| .. _Server hw limits: .. table:: Server hardware limits :class: longtable ======== ================= ================ Hardware Theoretical Certified ======== ================= ================ RAM 64 TB 1 TB CPU 5120 logical CPUs 384 logical CPUs ======== ================= ================ A logical CPU is a core (thread) in a multicore (multithreading) processor. .. _Hardware Requirements: Hardware Requirements ~~~~~~~~~~~~~~~~~~~~~ Deployed |product_name| consists of a single management node and a number of storage and compute nodes. The following subsections list requirements to node hardware depending on usage scenario. .. _Requirements for Management Node with Storage and Compute: Requirements for Management Node with Storage and Compute ********************************************************* The following table lists the minimal and recommended hardware requirements for a management node that also runs the storage and compute services. If you plan to enable high availability for the management node (recommended), all the servers that you will add to the HA cluster must meet the requirements listed in this table. .. tabularcolumns:: |>{\TL}\X{1}{7}% |>{\TL}\X{3}{7}% |>{\TL}\X{3}{7}| .. _Hw for management node with storage and compute: .. table:: Hardware for management node with storage and compute :class: longtable +---------+--------------------------------------+--------------------------------------+ | Type | Minimal | Recommended | +=========+======================================+======================================+ | CPU | 64-bit x86 Intel processors with | 64-bit x86 Intel processors with | | | "unrestricted guest" and VT-x with | "unrestricted guest" and VT-x with | | | Extended Page Tables (EPT) enabled | Extended Page Tables (EPT) enabled | | | in BIOS | in BIOS | | | | | | | 16 logical CPUs in total\* | 32+ logical CPUs in total\* | +---------+--------------------------------------+--------------------------------------+ | RAM | 32 GB\*\* | 64+ GB\*\* | +---------+--------------------------------------+--------------------------------------+ | Storage | 1 disk: system + metadata, 100+ GB | 2+ disks: system + metadata + cache, | | | SATA HDD | 100+ GB recommended enterprise-grade | | | | SSDs in a RAID1 volume, with power | | | 1 disk: storage, SATA HDD, size as | loss protection and 75 MB/s | | | required | sequential write performance per | | | | serviced HDD (e.g., 750+ MB/s total | | | | for a 10-disk node) | | | | | | | | 4+ disks: HDD or SSD, 1 DWPD | | | | endurance minimum, 10 DWPD | | | | recommended | +---------+--------------------------------------+--------------------------------------+ | Network | 1 GbE for storage traffic | 2 x 10 GbE (bonded) for storage | | | | traffic | | | 1 GbE (VLAN tagged) for other | | | | traffic | 2 x 1 GbE or 2 x 10 GbE (bonded, | | | | VLAN tagged) for other traffic | +---------+--------------------------------------+--------------------------------------+ .. include:: /includes/planning-node-hardware-configurations-part1.inc .. _Requirements for Storage and Compute: Requirements for Storage and Compute ************************************ The following table lists the minimal and recommended hardware requirements for a node that runs the storage and compute services. .. tabularcolumns:: |>{\TL}\X{1}{7}% |>{\TL}\X{3}{7}% |>{\TL}\X{3}{7}| .. _Hw for storage and compute: .. table:: Hardware for storage and compute :class: longtable +---------+--------------------------------------+--------------------------------------+ | Type | Minimal | Recommended | +=========+======================================+======================================+ | CPU | 64-bit x86 processor(s) with Intel | 64-bit x86 processor(s) with Intel | | | VT hardware virtualization | VT hardware virtualization | | | extensions enabled | extensions enabled | | | | | | | 8 logical CPUs\* | 32+ logical CPUs\* | +---------+--------------------------------------+--------------------------------------+ | RAM | 8 GB\*\* | 64+ GB\*\* | +---------+--------------------------------------+--------------------------------------+ | Storage | 1 disk: system, 100 GB SATA HDD | 2+ disks: system, 100+ GB SATA HDDs | | | | in a RAID1 volume | | | 1 disk: metadata, 100 GB SATA HDD | | | | (only on the first three nodes in | 1+ disk: metadata + cache, 100+ GB | | | the cluster) | enterprise-grade SSD with power loss | | | | protection and 75 MB/s sequential | | | 1 disk: storage, SATA HDD, size as | write performance per serviced HDD | | | required | (e.g., 750+ MB/s total for a 10-disk | | | | node) | | | | | | | | 4+ disks: HDD or SSD, 1 DWPD | | | | endurance minimum, 10 DWPD | | | | recommended | +---------+--------------------------------------+--------------------------------------+ | Network | 1 GbE for storage traffic | 2 x 10 GbE (bonded) for storage | | | | traffic | | | 1 GbE (VLAN tagged) for other | | | | traffic | 2 x 1 GbE or 2 x 10 GbE (bonded) | | | | for other traffic | +---------+--------------------------------------+--------------------------------------+ .. include:: /includes/planning-node-hardware-configurations-part1.inc .. only:: ac .. _Hardware Requirements for Backup Gateway: Hardware Requirements for Backup Gateway **************************************** The following table lists the minimal and recommended hardware requirements for a management node that also runs the storage and ABGW services. .. tabularcolumns:: |>{\TL}\X{1}{7}% |>{\TL}\X{3}{7}% |>{\TL}\X{3}{7}| .. _Hw for Backup Gateway: .. table:: Hardware for Backup Gateway :class: longtable +---------+--------------------------------------+--------------------------------------+ | Type | Minimal | Recommended | +=========+======================================+======================================+ | CPU | 64-bit x86 processor(s) with AMD-V | 64-bit x86 processor(s) with AMD-V | | | or Intel VT hardware virtualization | or Intel VT hardware virtualization | | | extensions enabled | extensions enabled | | | | | | | 4 logical CPUs\* | 4+ logical CPUs\*, at least one CPU | | | | per 8 HDDs | +---------+--------------------------------------+--------------------------------------+ | RAM | 4 GB\*\* | 16+ GB\*\* | +---------+--------------------------------------+--------------------------------------+ | Storage | 1 disk: system + metadata, 120 GB | 1 disk: system + metadata + cache, | | | SATA HDD | 120+ GB recommended enterprise-grade | | | | SSD with power loss protection and | | | 1 disk: storage, SATA HDD, size as | 75 MB/s sequential write performance | | | required | per serviced HDD | | | | | | | | 1 disk: storage, SATA HDD, 1 DWPD | | | | endurance minimum, 10 DWPD | | | | recommended, size as required | +---------+--------------------------------------+--------------------------------------+ | Network | 1 GbE | 2 x 10 GbE (bonded) | +---------+--------------------------------------+--------------------------------------+ .. include:: /includes/planning-node-hardware-configurations-part1.inc .. _Hardware Recommendations: Hardware Recommendations ~~~~~~~~~~~~~~~~~~~~~~~~ In general, |product_name| works on the same hardware that is recommended for Red Hat Enterprise Linux 7: `servers `__, `components `__. The following recommendations further explain the benefits added by specific hardware in the hardware requirements table. Use them to configure your cluster in an optimal way. .. _Storage Cluster Composition Recommendations: Storage Cluster Composition Recommendations ******************************************* Designing an efficient storage cluster means finding a compromise between performance and cost that suits your purposes. When planning, keep in mind that a cluster with many nodes and few disks per node offers higher performance while a cluster with the minimal number of nodes (3) and a lot of disks per node is cheaper. See the following table for more details. .. tabularcolumns:: |>{\TL}\X{2}{8}% |>{\TL}\X{3}{8}% |>{\TL}\X{3}{8}| .. _Cluster composition recom: .. table:: Cluster composition recommendations :class: longtable +------------------------+---------------------------------------+-------------------------------------+ | Design considerations | Minimum nodes (3), | Many nodes, few disks per node | | | many disks per node | (all-flash configuration) | +========================+=======================================+=====================================+ | Optimization | Lower cost. | Higher performance. | +------------------------+---------------------------------------+-------------------------------------+ | Free disk space to | More space to reserve for | Less space to reserve for cluster | | reserve | cluster rebuilding as fewer | rebuilding as more healthy nodes | | | healthy nodes will have to | will have to store the data | | | store the data from a failed node. | from a failed node. | +------------------------+---------------------------------------+-------------------------------------+ | Redundancy | Fewer erasure coding choices. | More erasure coding choices. | +------------------------+---------------------------------------+-------------------------------------+ | Cluster balance and | Worse balance and slower rebuilding. | Better balance and faster | | rebuilding performance | | rebuilding. | +------------------------+---------------------------------------+-------------------------------------+ | Network capacity | More network bandwidth required to | Less network bandwidth required | | | maintain cluster performance during | to maintain cluster performance | | | rebuilding. | during rebuilding. | +------------------------+---------------------------------------+-------------------------------------+ | Favorable data type | Cold data (e.g., backups). | Hot data (e.g., virtual | | | | environments). | +------------------------+---------------------------------------+-------------------------------------+ | Sample server | Supermicro SSG-6047R-E1R36L (Intel | Supermicro SYS-2028TP-HC0R-SIOM | | configuration | Xeon E5-2620 v1/v2 CPU, 32GB RAM, | (4 x Intel E5-2620 v4 CPUs, | | | 36 x 12TB HDDs, a 500GB system disk). | 4 x 16GB RAM, 24 x 1.9TB Samsung | | | | PM1643 SSDs). | +------------------------+---------------------------------------+-------------------------------------+ Take note of the following: #. These considerations only apply if failure domain is host. #. The speed of rebuilding in the replication mode does not depend on the number of nodes in the cluster. #. |product_name| supports hundreds of disks per node. If you plan to use more than 36 disks per node, contact sales engineers who will help you design a more efficient cluster. .. _General Hardware Recommendations: General Hardware Recommendations ******************************** - At least five nodes are required for a production environment. This is to ensure that the cluster can survive failure of two nodes without data loss. - One of the strongest features of |product_name| is scalability. The bigger the cluster, the better |product_name| performs. It is recommended to create production clusters from at least ten nodes for improved resilience, performance, and fault tolerance in production scenarios. - Even though a cluster can be created on top of varied hardware, using nodes with similar hardware in each node will yield better cluster performance, capacity, and overall balance. - Any cluster infrastructure must be tested extensively before it is deployed to production. Such common points of failure as SSD drives and network adapter bonds must always be thoroughly verified. - It is not recommended for production to run |product_name| on top of SAN/NAS hardware that has its own redundancy mechanisms. Doing so may negatively affect performance and data availability. - To achieve best performance, keep at least 20% of cluster capacity free. - During disaster recovery, |product_name| may need additional disk space for replication. Make sure to reserve at least as much space as any single storage node has. - It is recommended to have the same CPU models on each node to avoid VM live migration issues. For more details, see the *Administrator's Command Line Guide*. - If you plan to use Backup Gateway to store backups in the cloud, make sure the local storage cluster has plenty of logical space for staging (keeping backups locally before sending them to the cloud). For example, if you perform backups daily, provide enough space for at least 1.5 days' worth of backups. For more details, see the *Administrator's Guide*. .. _Storage Hardware Recommendations: Storage Hardware Recommendations ******************************** - It is possible to use disks of different size in the same cluster. However, keep in mind that, given the same IOPS, smaller disks will offer higher performance per terabyte of data compared to bigger disks. It is recommended to group disks with the same IOPS per terabyte in the same tier. - Using the recommended SSD models may help you avoid loss of data. Not all SSD drives can withstand enterprise workloads and may break down in the first months of operation, resulting in TCO spikes. - SSD memory cells can withstand a limited number of rewrites. An SSD drive should be viewed as a consumable that you will need to replace after a certain time. Consumer-grade SSD drives can withstand a very low number of rewrites (so low, in fact, that these numbers are not shown in their technical specifications). SSD drives intended for storage clusters must offer at least 1 DWPD endurance (10 DWPD is recommended). The higher the endurance, the less often SSDs will need to be replaced, improving TCO. - Many consumer-grade SSD drives can ignore disk flushes and falsely report to operating systems that data was written while it, in fact, was not. Examples of such drives include OCZ Vertex 3, Intel 520, Intel X25-E, and Intel X-25-M G2. These drives are known to be unsafe in terms of data commits, they should not be used with databases, and they may easily corrupt the file system in case of a power failure. For these reasons, use enterprise-grade SSD drives that obey the flush rules (for more information, see http://www.postgresql.org/docs/current/static/wal-reliability.html). Enterprise-grade SSD drives that operate correctly usually have the power loss protection property in their technical specification. Some of the market names for this technology are Enhanced Power Loss Data Protection (Intel), Cache Power Protection (Samsung), Power-Failure Support (Kingston), Complete Power Fail Protection (OCZ). - It is highly recommended to check the data flushing capabilities of your disks as explained in :ref:`Checking Disk Data Flushing Capabilities`. - Consumer-grade SSD drives usually have unstable performance and are not suited to withstand sustainable enterprise workloads. For this reason, pay attention to sustainable load tests when choosing SSDs. We recommend the following enterprise-grade SSD drives which are the best in terms of performance, endurance, and investments: Intel S3710, Intel P3700, Huawei ES3000 V2, Samsung SM1635, and Sandisk Lightning. - Performance of SSD disks may depend on their size. Lower-capacity drives (100 to 400 GB) may perform much slower (sometimes up to ten times slower) than higher-capacity ones (1.9 to 3.8 TB). Consult drive performance and endurance specifications before purchasing hardware. - Using NVMe or SAS SSDs for write caching improves random I/O performance and is highly recommended for all workloads with heavy random access (e.g., iSCSI volumes). In turn, SATA disks are best suited for SSD-only configurations but not write caching. - Using shingled magnetic recording (SMR) HDDs is strongly not recommended, even for backup scenarios. Such disks have unpredictable latency that may lead to unexpected temporary service outages and sudden performance degradations. - Running metadata services on SSDs improves cluster performance. To also minimize CAPEX, the same SSDs can be used for write caching. - If capacity is the main goal and you need to store non-frequently accessed data, choose SATA disks over SAS ones. If performance is the main goal, choose NVMe or SAS disks over SATA ones. - The more disks per node the lower the CAPEX. As an example, a cluster created from ten nodes with two disks in each will be less expensive than a cluster created from twenty nodes with one disk in each. - Using SATA HDDs with one SSD for caching is more cost effective than using only SAS HDDs without such an SSD. - Create hardware or software RAID1 volumes for system disks using RAID or HBA controllers, respectively, to ensure its high performance and availability. - Use HBA controllers as they are less expensive and easier to manage than RAID controllers. - Disable all RAID controller caches for SSD drives. Modern SSDs have good performance that can be reduced by a RAID controller's write and read cache. It is recommend to disable caching for SSD drives and leave it enabled only for HDD drives. - If you use RAID controllers, do not create RAID volumes from HDDs intended for storage. Each storage HDD needs to be recognized by |product_name| as a separate device. - If you use RAID controllers with caching, equip them with backup battery units (BBUs) to protect against cache loss during power outages. - Disk block size (e.g., 512b or 4K) is not important and has no effect on performance. .. _Network Hardware Recommendations: Network Hardware Recommendations ******************************** - Use separate networks (and, ideally albeit optionally, separate network adapters) for internal and public traffic. Doing so will prevent public traffic from affecting cluster I/O performance and also prevent possible denial-of-service attacks from the outside. - Network latency dramatically reduces cluster performance. Use quality network equipment with low latency links. Do not use consumer-grade network switches. - Do not use desktop network adapters like Intel EXPI9301CTBLK or Realtek 8129 as they are not designed for heavy load and may not support full-duplex links. Also use non-blocking Ethernet switches. - To avoid intrusions, |product_name| should be on a dedicated internal network inaccessible from outside. - Use one 1 Gbit/s link per each two HDDs on the node (rounded up). For one or two HDDs on a node, two bonded network interfaces are still recommended for high network availability. The reason for this recommendation is that 1 Gbit/s Ethernet networks can deliver 110-120 MB/s of throughput, which is close to sequential I/O performance of a single disk. Since several disks on a server can deliver higher throughput than a single 1 Gbit/s Ethernet link, networking may become a bottleneck. - For maximum sequential I/O performance, use one 1Gbit/s link per each hard drive, or one 10Gbit/s link per node. Even though I/O operations are most often random in real-life scenarios, sequential I/O is important in backup scenarios. - For maximum overall performance, use one 10 Gbit/s link per node (or two bonded for high network availability). - It is not recommended to configure 1 Gbit/s network adapters to use non-default MTUs (e.g., 9000-byte jumbo frames). Such settings require additional configuration of switches and often lead to human error. 10+ Gbit/s network adapters, on the other hand, need to be configured to use jumbo frames to achieve full performance. - The currently supported Fibre Channel host bus adapters (HBAs) are QLogic QLE2562-CK and QLogic ISP2532. - It is recommended to use Mellanox ConnectX-4 and ConnectX-5 InfiniBand adapters. Mellanox ConnectX-2 and ConnectX-3 cards are not supported. - Adapters using the BNX2X driver, such as Broadcom Limited BCM57840 NetXtreme II 10/20-Gigabit Ethernet / HPE FlexFabric 10Gb 2-port 536FLB Adapter, are not recommended. They limit MTU to 3616, which affects the cluster performance. .. _Hardware and Software Limitations: Hardware and Software Limitations ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Hardware limitations: - Each management node must have at least two disks (one system+metadata, one storage). - Each compute or storage node must have at least three disks (one system, one metadata, one storage). - Three servers are required to test all the features of the product. - Each server must have at least 4GB of RAM and two logical cores. - The system disk must have at least 100 GBs of space. - Admin panel requires a Full HD monitor to be displayed correctly. - The maximum supported physical partition size is 254 TiB. Software limitations: - The maintenance mode is not supported. Use SSH to shut down or reboot a node. - One node can be a part of only one cluster. - Only one S3 cluster can be created on top of a storage cluster. - Only predefined redundancy modes are available in the admin panel. - Thin provisioning is always enabled for all data and cannot be configured otherwise. - Admin panel has been tested to work at resolutions 1280x720 and higher in the following web browsers: latest Firefox, Chrome, Safari. .. slated for 2.5u1: Microsoft Edge, as well as Internet Explorer 11 For network limitations, see :ref:`Network Limitations`. .. _Minimum Storage Configuration: Minimum Storage Configuration ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The minimum configuration described in the table will let you evaluate the features of the storage cluster. It is not meant for production. .. tabularcolumns:: |>{\TL}\X{1}{10}% |>{\TL}\X{2}{10}% |>{\TL}\X{2}{10}% |>{\TL}\X{2}{10}% |>{\TL}\X{3}{10}| .. _Min cluster configuration: .. table:: Minimum cluster configuration :class: longtable +----------+---------------+-----------------+-----------------+-------------------------------+ | Node # | 1st disk role | 2nd disk role | 3rd+ disk roles | Access points | +==========+===============+=================+=================+===============================+ | 1 | System | Metadata | Storage | iSCSI, S3 private, S3 public, | | | | | | NFS, ABGW | +----------+---------------+-----------------+-----------------+-------------------------------+ | 2 | System | Metadata | Storage | iSCSI, S3 private, S3 public, | | | | | | NFS, ABGW | +----------+---------------+-----------------+-----------------+-------------------------------+ | 3 | System | Metadata | Storage | iSCSI, S3 private, S3 public, | | | | | | NFS, ABGW | +----------+---------------+-----------------+-----------------+-------------------------------+ | 3 nodes | |_| | 3 MDSs in total | 3+ CSs in total | Access point services run | | in total | | | | on three nodes in total. | +----------+---------------+-----------------+-----------------+-------------------------------+ .. note:: SSD disks can be assigned **System**, **Metadata**, and **Cache** roles at the same time, freeing up more disks for the storage role. Even though three nodes are recommended even for the minimal configuration, you can start evaluating |product_name| with just one node and add more nodes later. At the very least, a storage cluster must have one metadata service and one chunk service running. A single-node installation will let you evaluate services such as iSCSI, ABGW, etc. However, such a configuration will have two key limitations: #. Just one MDS will be a single point of failure. If it fails, the entire cluster will stop working. #. Just one CS will be able to store just one chunk replica. If it fails, the data will be lost. .. important:: If you deploy |product_name| on a single node, you must take care of making its storage persistent and redundant to avoid data loss. If the node is physical, it must have multiple disks so you can replicate the data among them. If the node is a virtual machine, make sure that this VM is made highly available by the solution it runs on. .. note:: Backup Gateway works with the local object storage in the staging mode. It means that the data to be replicated, migrated, or uploaded to a public cloud is first stored locally and only then sent to the destination. It is vital that the local object storage is persistent and redundant so the local data does not get lost. There are multiple ways to ensure the persistence and redundancy of the local storage. You can deploy your Backup Gateway on multiple nodes and select a good redundancy mode. If your gateway is deployed on a single node in |product_name|, you can make its storage redundant by replicating it among multiple local disks. If your entire |product_name| installation is deployed in a single virtual machine with the sole purpose of creating a gateway, make sure this VM is made highly available by the solution it runs on. .. _Recommended Storage Configuration: Recommended Storage Configuration ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ It is recommended to have at least five metadata services to ensure that the cluster can survive simultaneous failure of two nodes without data loss. The following configuration will help you create clusters for production environments: .. tabularcolumns:: |>{\TL}\X{3}{20}% |>{\TL}\X{4}{20}% |>{\TL}\X{4}{20}% |>{\TL}\X{4}{20}% |>{\TL}\X{5}{20}| .. _Recom cluster configuration: .. table:: Recommended cluster configuration :class: longtable +----------+---------------+----------------+-----------------+-------------------------+ | Node # | 1st disk role | 2nd disk role | 3rd+ disk roles | Access points | +==========+===============+================+=================+=========================+ | Nodes | System | SSD; metadata, | Storage | iSCSI, S3 private, | | 1 to 5 | | cache | | S3 public, ABGW | +----------+---------------+----------------+-----------------+-------------------------+ | Nodes 6+ | System | SSD; cache | Storage | iSCSI, S3 private, ABGW | +----------+---------------+----------------+-----------------+-------------------------+ | 5+ nodes | |_| | 5 MDSs | 5+ CSs in total | All nodes run required | | in total | | in total | | access points. | +----------+---------------+----------------+-----------------+-------------------------+ A production-ready cluster can be created from just five nodes with recommended hardware. However, it is recommended to enter production with at least ten nodes if you are aiming to achieve significant performance advantages over direct-attached storage (DAS) or improved recovery times. Following are a number of more specific configuration examples that can be used in production. Each configuration can be extended by adding chunk servers and nodes. HDD Only ******** This basic configuration requires a dedicated disk for each metadata server. .. tabularcolumns:: |>{\TL}\X{2}{20}% |>{\TL}\X{3}{20}% |>{\TL}\X{3}{20}% ||>{\TL}\X{2}{20}% |>{\TL}\X{3}{20}% |>{\TL}\X{3}{20}| .. _HDD only config: .. table:: HDD only configuration :class: longtable ====== ========= ========== ====== ========= ========== Nodes 1-5 (base) Nodes 6+ (extension) ----------------------------- ----------------------------- Disk # Disk type Disk roles Disk # Disk type Disk roles ====== ========= ========== ====== ========= ========== 1 HDD System 1 HDD System 2 HDD MDS 2 HDD CS 3 HDD CS 3 HDD CS ... ... ... ... ... ... N HDD CS N HDD CS ====== ========= ========== ====== ========= ========== HDD + System SSD (No Cache) *************************** This configuration is good for creating capacity-oriented clusters. .. tabularcolumns:: |>{\TL}\X{2}{20}% |>{\TL}\X{3}{20}% |>{\TL}\X{3}{20}% ||>{\TL}\X{2}{20}% |>{\TL}\X{3}{20}% |>{\TL}\X{3}{20}| .. _HDD + system SSD (no cache) config: .. table:: HDD + system SSD (no cache) configuration :class: longtable ====== ========= =========== ====== ========= ========== Nodes 1-5 (base) Nodes 6+ (extension) ------------------------------ ----------------------------- Disk # Disk type Disk roles Disk # Disk type Disk roles ====== ========= =========== ====== ========= ========== 1 SSD System, MDS 1 SSD System 2 HDD CS 2 HDD CS 3 HDD CS 3 HDD CS ... ... ... ... ... ... N HDD CS N HDD CS ====== ========= =========== ====== ========= ========== HDD + SSD ********* This configuration is good for creating performance-oriented clusters. .. tabularcolumns:: |>{\TL}\X{2}{20}% |>{\TL}\X{3}{20}% |>{\TL}\X{3}{20}% ||>{\TL}\X{2}{20}% |>{\TL}\X{3}{20}% |>{\TL}\X{3}{20}| .. _HDD + SSD config: .. table:: HDD + SSD configuration :class: longtable ====== ========= ========== ====== ========= ========== Nodes 1-5 (base) Nodes 6+ (extension) ----------------------------- ----------------------------- Disk # Disk type Disk roles Disk # Disk type Disk roles ====== ========= ========== ====== ========= ========== 1 HDD System 1 HDD System 2 SSD MDS, cache 2 SSD Cache 3 HDD CS 3 HDD CS ... ... ... ... ... ... N HDD CS N HDD CS ====== ========= ========== ====== ========= ========== SSD Only ******** This configuration does not require SSDs for cache. When choosing hardware for this configuration, have in mind the following: - Each |product_name| client will be able to obtain up to about 40K sustainable IOPS (read + write) from the cluster. - If you use the erasure coding redundancy scheme, each erasure coding file, e.g., a single VM HDD disk, will get up to 2K sustainable IOPS. That is, a user working inside a VM will have up to 2K sustainable IOPS per virtual HDD at their disposal. Multiple VMs on a node can utilize more IOPS, up to the client's limit. - In this configuration, network latency defines more than half of overall performance, so make sure that the network latency is minimal. One recommendation is to have one 10Gbps switch between any two nodes in the cluster. .. tabularcolumns:: |>{\TL}\X{2}{20}% |>{\TL}\X{3}{20}% |>{\TL}\X{3}{20}% ||>{\TL}\X{2}{20}% |>{\TL}\X{3}{20}% |>{\TL}\X{3}{20}| .. _SSD only config: .. table:: SSD only configuration :class: longtable ====== ========= =========== ====== ========= ========== Nodes 1-5 (base) Nodes 6+ (extension) ------------------------------ ----------------------------- Disk # Disk type Disk roles Disk # Disk type Disk roles ====== ========= =========== ====== ========= ========== 1 SSD System, MDS 1 SSD System 2 SSD CS 2 SSD CS 3 SSD CS 3 SSD CS ... ... ... ... ... ... N SSD CS N SSD CS ====== ========= =========== ====== ========= ========== HDD + SSD (No Cache), 2 Tiers ***************************** In this configuration example, tier 1 is for HDDs without cache and tier 2 is for SSDs. Tier 1 can store cold data (e.g., backups), tier 2 can store hot data (e.g., high-performance virtual machines). .. tabularcolumns:: |>{\TL}\X{2}{22}% |>{\TL}\X{3}{22}% |>{\TL}\X{3}{22}% |>{\TL}\X{3}{22}% ||>{\TL}\X{2}{22}% |>{\TL}\X{3}{22}% |>{\TL}\X{3}{22}% |>{\TL}\X{3}{22}| .. _HDD + SSD (no cache) 2-tier config: .. table:: HDD + SSD (no cache) 2-tier configuration :class: longtable ====== ========= =========== ======= ====== ========= ========== ==== Nodes 1-5 (base) Nodes 6+ (extension) --------------------------------------- ----------------------------------- Disk # Disk type Disk roles Tier Disk # Disk type Disk roles Tier ====== ========= =========== ======= ====== ========= ========== ==== 1 SSD System, MDS 1 SSD System 2 SSD CS 2 2 SSD CS 2 3 HDD CS 1 3 HDD CS 1 ... ... ... ... ... ... ... ... N HDD/SSD CS 1/2 N HDD/SSD CS 1/2 ====== ========= =========== ======= ====== ========= ========== ==== HDD + SSD, 3 Tiers ****************** In this configuration example, tier 1 is for HDDs without cache, tier 2 is for HDDs with cache, and tier 3 is for SSDs. Tier 1 can store cold data (e.g., backups), tier 2 can store regular virtual machines, and tier 3 can store high-performance virtual machines. .. tabularcolumns:: |>{\TL}\X{2}{22}% |>{\TL}\X{3}{22}% |>{\TL}\X{3}{22}% |>{\TL}\X{3}{22}% ||>{\TL}\X{2}{22}% |>{\TL}\X{3}{22}% |>{\TL}\X{3}{22}% |>{\TL}\X{3}{22}| .. _HDD + SSD 3-tier config: .. table:: HDD + SSD 3-tier configuration :class: longtable ====== ========= ============= ======= ====== ========= ========== ===== Nodes 1-5 (base) Nodes 6+ (extension) ----------------------------------------- ------------------------------------ Disk # Disk type Disk roles Tier Disk # Disk type Disk roles Tier ====== ========= ============= ======= ====== ========= ========== ===== 1 HDD/SSD System 1 HDD/SSD System 2 SSD MDS, T2 cache 2 SSD T2 cache 3 HDD CS 1 3 HDD CS 1 4 HDD CS 2 4 HDD CS 2 5 SSD CS 3 5 SSD CS 3 ... ... ... ... ... ... ... ... N HDD/SSD CS 1/2/3 N HDD/SSD CS 1/2/3 ====== ========= ============= ======= ====== ========= ========== ===== .. _Raw Disk Space Considerations: Raw Disk Space Considerations ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ When planning the infrastructure, keep in mind the following to avoid confusion: - The capacity of HDD and SSD is measured and specified with decimal, not binary prefixes, so "TB" in disk specifications usually means "terabyte". The operating system, however, displays drive capacity using binary prefixes meaning that "TB" is "tebibyte" which is a noticeably larger number. As a result, disks may show capacity smaller than the one marketed by the vendor. For example, a disk with 6TB in specifications may be shown to have 5.45 TB of actual disk space in |product_name|. - 5% of disk space is reserved for emergency needs. Therefore, if you add a 6TB disk to a cluster, the available physical space should increase by about 5.2 TB. .. _Checking Disk Data Flushing Capabilities: Checking Disk Data Flushing Capabilities ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ It is highly recommended to make sure that all storage devices you plan to include in your cluster can flush data from cache to disk if power goes out unexpectedly. Thus you will find devices that may lose data in a power failure. |product_name| ships with the ``vstorage-hwflush-check`` tool that checks how a storage device flushes data to disk in emergencies. The tool is implemented as a client/server utility: - The client continuously writes blocks of data to the storage device. When a data block is written, the client increases a special counter and sends it to the server that keeps it. - The server keeps track of counters incoming from the client and always knows the next counter number. If the server receives a counter smaller than the one it has (e.g., because the power has failed and the storage device has not flushed the cached data to disk), the server reports an error. To check that a storage device can successfully flush data to disk when power fails, follow the procedure below: #. Install the tool from the ``vstorage-ctl`` package available in the official repository. For example: :: # wget http://repo.virtuozzo.com/hci/releases/3.0/x86_64/os/Packages/v/\ vstorage-ctl-7.9.198-1.vl7.x86_64.rpm # yum install vstorage-ctl-7.9.198-1.vl7.x86_64.rpm Do this on all the nodes involved in tests. #. On one node, run the server: :: # vstorage-hwflush-check -l #. On a different node that hosts the storage device you want to test, run the client, for example: :: # vstorage-hwflush-check -s vstorage1.example.com -d /vstorage/stor1-ssd/test -t 50 where - ``vstorage1.example.com`` is the host name of the server. - ``/vstorage/stor1-ssd/test`` is the directory to use for data flushing tests. During execution, the client creates a file in this directory and writes data blocks to it. - ``50`` is the number of threads for the client to write data to disk. Each thread has its own file and counter. You can increase the number of threads (max. 200) to test your system in more stressful conditions. You can also specify other options when running the client. For more information on available options, see the ``vstorage-hwflush-check`` man page. #. Wait for at least 10-15 seconds, cut power from the client node (either press the **Power** button or pull the power cord out) and then power it on again. #. Restart the client: :: # vstorage-hwflush-check -s vstorage1.example.com -d /vstorlage/stor1-ssd/test -t 50 Once launched, the client will read all previously written data, determine the version of data on the disk, and restart the test from the last valid counter. It then will send this valid counter to the server and the server will compare it to the latest counter it has. You may see output like: :: id: -> which means one of the following: - If the counter on the disk is lower than the counter on the server, the storage device has failed to flush the data to the disk. Avoid using this storage device in production, especially for CS or journals, as you risk losing data. - If the counter on the disk is higher than the counter on the server, the storage device has flushed the data to the disk but the client has failed to report it to the server. The network may be too slow or the storage device may be too fast for the set number of load threads so consider increasing it. This storage device can be used in production. - If both counters are equal, the storage device has flushed the data to the disk and the client has reported it to the server. This storage device can be used in production. To be on the safe side, repeat the procedure several times. Once you have checked your first storage device, continue with all the remaining devices you plan to use in the cluster. You need to test all devices you plan to use in the cluster: SSD disks used for CS journaling, disks used for MDS journals and chunk servers.