.. _Monitoring Event Logs: Monitoring Event Logs --------------------- You can use the ``vstorage -c top`` utility to monitor significant events happening in the storage cluster. For example: .. image:: /images/image029.png :align: center :class: align-center The command above shows the latest events in the ``stor1`` cluster. The information on events (highlighted in red) is given in a table with the following columns: .. raw:: latex \setlist[description]{leftmargin=!,labelindent=0pt,labelwidth=1em+\widthof{MESSAGE}} **TIME** Time of event. **SYS** Component of the cluster where the event happened (e.g., MDS for an MDS server or JRN for local journal). **SEV** Event severity. **MESSAGE** Event description. .. raw:: latex \setlist[description]{leftmargin=!,labelindent=0pt,labelwidth=1em+\widthof{ }} The following table lists basic events displayed when you run the ``vstorage top`` utility. .. tabularcolumns:: |>{\TL}\X{2}{6}% |>{\TL}\X{1}{6}% |>{\TL}\X{3}{6}| .. _Basic events: .. table:: Basic events :class: longtable +--------------------------------------+---------------+-----------------------------------------------+ | Event | Severity | Description | +======================================+===============+===============================================+ | MDS# (:) lags | JRN err | Generated by the MDS master server when it | | behind for more than 1000 rounds | | detects that MDS# is stale. | | | | | | | | This message may indicate that some MDS | | | | server is very slow and lags behind. | +--------------------------------------+---------------+-----------------------------------------------+ | MDS# (:) didn't | JRN err | Generated by the MDS master server if MDS# | | accept commits for *M* sec | | did not accept commits for *M* seconds. | | | | MDS# gets marked as stale. | | | | | | | | This message may indicate that the MDS | | | | service on MDS# is experiencing a problem. | | | | The problem may be critical and should be | | | | resolved as soon as possible. | +--------------------------------------+---------------+-----------------------------------------------+ | MDS# (:) state is | JRN err | Generated by the MDS master server when | | outdated and will do a full resync | | MDS# will do a full resync. MDS# gets | | | | marked as stale. | | | | | | | | This message may indicate that some MDS | | | | server was too slow or disconnected for such | | | | a long time that it is not really managing | | | | the state of metadata and has to be | | | | resynchronized. The problem may be critical | | | | and should be resolved as soon as possible. | +--------------------------------------+---------------+-----------------------------------------------+ | MDS# at : became | JRN info | Generated every time a new MDS master server | | master | | is elected in the cluster. | | | | | | | | Frequent changes of MDS masters may indicate | | | | poor network connectivity and may affect the | | | | cluster operation. | +--------------------------------------+---------------+-----------------------------------------------+ | The cluster is healthy with *N* | MDS info | Generated when the cluster status changes to | | active CS | | healthy or when a new MDS master server is | | | | elected. | | | | | | | | This message indicates that all chunk servers | | | | in the cluster are active and the number of | | | | replicas meets the set cluster requirements. | +--------------------------------------+---------------+-----------------------------------------------+ | The cluster is degraded with *N* | MDS warn | Generated when the cluster status changes to | | active, *M* inactive, *K* offline CS | | degraded or when a new MDS master server | | | | is elected. | | | | | | | | This message indicates that some chunk | | | | servers in the cluster are | | | | | | | | - inactive, i.e. do not send any | | | | registration messages, or | | | | - offline, i.e. have been inactive for | | | | longer than ``mds.wd.offline_tout``, | | | | which is 5min by default. | +--------------------------------------+---------------+-----------------------------------------------+ | The cluster failed with *N* active, | MDS err | Generated when the cluster status changes to | | *M* inactive, *K* offline CS | | failed or when a new MDS master server is | | (mds.wd.max_offline_cs=) | | elected. | | | | | | | | This message indicates that the number of | | | | offline chunk servers exceeds | | | | ``mds.wd.max_offline_cs``, which is 2 by | | | | default. When the cluster fails, the | | | | automatic replication is not scheduled any | | | | more. So the cluster administrator must take | | | | action to either repair failed chunk servers | | | | or increase ``mds.wd.max_offline_cs``. | | | | Setting this value to 0 disables the failed | | | | mode completely. | +--------------------------------------+---------------+-----------------------------------------------+ | The cluster is filled up to % | MDS info/warn | Shows the current space usage in the cluster. | | | | A warning is generated if the disk space | | | | consumption equals or exceeds 80%. | | | | | | | | It is important to have spare disk space for | | | | data replicas if one of the chunk servers | | | | fails. | +--------------------------------------+---------------+-----------------------------------------------+ | Replication started, N chunks are | MDS info | Generated when the cluster starts automatic | | queued | | data replication to recover the missing | | | | replicas. | +--------------------------------------+---------------+-----------------------------------------------+ | Replication completed | MDS info | Generated when the cluster finishes automatic | | | | data replication. | +--------------------------------------+---------------+-----------------------------------------------+ | CS# has reported hard error | MDS warn | Generated when the chunk server CS# | | on *path* | | detects disk data corruption. | | | | | | | | You are recommended to replace chunk servers | | | | with corrupted disks as soon as possible with | | | | new ones and to check the hardware for | | | | errors. | +--------------------------------------+---------------+-----------------------------------------------+ | CS# has not registered during | MDS warn | Generated when the chunk server CS# has | | the last *T* sec and is marked | | been unavailable for a while. In this case, | | as inactive/offline | | the chunk server first gets marked as | | | | inactive. After 5 minutes, the state is | | | | changed to offline, which starts automatic | | | | replication of data to restore the replicas | | | | that were stored on the offline chunk server. | +--------------------------------------+---------------+-----------------------------------------------+ | Failed to allocate *N* replicas for | MDS warn | Generated when the cluster cannot allocate | | '*path*' by request from | | chunk replicas, for example, when it runs | | : - *K* out of *M* | | out of disk space. | | chunks servers are available | | | +--------------------------------------+---------------+-----------------------------------------------+ | Failed to allocate *N* replicas for | MDS warn | Generated when the cluster cannot allocate | | '*path*' by request from | | chunk replicas because not enough chunk | | : since only *K* chunk | | servers are registered in the cluster. | | servers are registered | | | +--------------------------------------+---------------+-----------------------------------------------+