.. _Monitoring Event Logs:

Monitoring Event Logs
---------------------

You can use the ``vstorage -c <cluster_name> top`` utility to monitor significant events happening in the storage cluster. For example:

.. image:: /images/image029.png
   :align: center
   :class: align-center

The command above shows the latest events in the ``stor1`` cluster. The information on events (highlighted in red) is given in a table with the following columns:

.. raw:: latex

   \setlist[description]{leftmargin=!,labelindent=0pt,labelwidth=1em+\widthof{MESSAGE}}

**TIME**
   Time of event.

**SYS**
   Component of the cluster where the event happened (e.g., MDS for an MDS server or JRN for local journal).

**SEV**
   Event severity.

**MESSAGE**
   Event description.

.. raw:: latex

   \setlist[description]{leftmargin=!,labelindent=0pt,labelwidth=1em+\widthof{ }}

The following table lists basic events displayed when you run the ``vstorage top`` utility.

.. tabularcolumns:: |>{\TL}\X{2}{6}%
                    |>{\TL}\X{1}{6}%
                    |>{\TL}\X{3}{6}|

.. _Basic events:

.. table:: Basic events
   :class: longtable

   +--------------------------------------+---------------+-----------------------------------------------+
   | Event                                | Severity      | Description                                   |
   +======================================+===============+===============================================+
   | MDS#<N> (<addr>:<port>) lags         | JRN err       | Generated by the MDS master server when it    |
   | behind for more than 1000 rounds     |               | detects that MDS#<N> is stale.                |
   |                                      |               |                                               |
   |                                      |               | This message may indicate that some MDS       |
   |                                      |               | server is very slow and lags behind.          |
   +--------------------------------------+---------------+-----------------------------------------------+
   | MDS#<N> (<addr>:<port>) didn't       | JRN err       | Generated by the MDS master server if MDS#<N> |
   | accept commits for *M* sec           |               | did not accept commits for *M* seconds.       |
   |                                      |               | MDS#<N> gets marked as stale.                 |
   |                                      |               |                                               |
   |                                      |               | This message may indicate that the MDS        |
   |                                      |               | service on MDS#<N> is experiencing a problem. |
   |                                      |               | The problem may be critical and should be     |
   |                                      |               | resolved as soon as possible.                 |
   +--------------------------------------+---------------+-----------------------------------------------+
   | MDS#<N> (<addr>:<port>) state is     | JRN err       | Generated by the MDS master server when       |
   | outdated and will do a full resync   |               | MDS#<N> will do a full resync. MDS#<N> gets   |
   |                                      |               | marked as stale.                              |
   |                                      |               |                                               |
   |                                      |               | This message may indicate that some MDS       |
   |                                      |               | server was too slow or disconnected for such  |
   |                                      |               | a long time that it is not really managing    |
   |                                      |               | the state of metadata and has to be           |
   |                                      |               | resynchronized. The problem may be critical   |
   |                                      |               | and should be resolved as soon as possible.   |
   +--------------------------------------+---------------+-----------------------------------------------+
   | MDS#<N> at <addr>:<port> became      | JRN info      | Generated every time a new MDS master server  |
   | master                               |               | is elected in the cluster.                    |
   |                                      |               |                                               |
   |                                      |               | Frequent changes of MDS masters may indicate  |
   |                                      |               | poor network connectivity and may affect the  |
   |                                      |               | cluster operation.                            |
   +--------------------------------------+---------------+-----------------------------------------------+
   | The cluster is healthy with *N*      | MDS info      | Generated when the cluster status changes to  |
   | active CS                            |               | healthy or when a new MDS master server is    |
   |                                      |               | elected.                                      |
   |                                      |               |                                               |
   |                                      |               | This message indicates that all chunk servers |
   |                                      |               | in the cluster are active and the number of   |
   |                                      |               | replicas meets the set cluster requirements.  |
   +--------------------------------------+---------------+-----------------------------------------------+
   | The cluster is degraded with *N*     | MDS warn      | Generated when the cluster status changes to  |
   | active, *M* inactive, *K* offline CS |               | degraded or when a new MDS master server      |
   |                                      |               | is elected.                                   |
   |                                      |               |                                               |
   |                                      |               | This message indicates that some chunk        |
   |                                      |               | servers in the cluster are                    |
   |                                      |               |                                               |
   |                                      |               | - inactive, i.e. do not send any              |
   |                                      |               |   registration messages, or                   |
   |                                      |               | - offline, i.e. have been inactive for        |
   |                                      |               |   longer than ``mds.wd.offline_tout``,        |
   |                                      |               |   which is 5min by default.                   |
   +--------------------------------------+---------------+-----------------------------------------------+
   | The cluster failed with *N* active,  | MDS err       | Generated when the cluster status changes to  |
   | *M* inactive, *K* offline CS         |               | failed or when a new MDS master server is     |
   | (mds.wd.max_offline_cs=<n>)          |               | elected.                                      |
   |                                      |               |                                               |
   |                                      |               | This message indicates that the number of     |
   |                                      |               | offline chunk servers exceeds                 |
   |                                      |               | ``mds.wd.max_offline_cs``, which is 2 by      |
   |                                      |               | default. When the cluster fails, the          |
   |                                      |               | automatic replication is not scheduled any    |
   |                                      |               | more. So the cluster administrator must take  |
   |                                      |               | action to either repair failed chunk servers  |
   |                                      |               | or increase ``mds.wd.max_offline_cs``.        |
   |                                      |               | Setting this value to 0 disables the failed   |
   |                                      |               | mode completely.                              |
   +--------------------------------------+---------------+-----------------------------------------------+
   | The cluster is filled up to <N>%     | MDS info/warn | Shows the current space usage in the cluster. |
   |                                      |               | A warning is generated if the disk space      |
   |                                      |               | consumption equals or exceeds 80%.            |
   |                                      |               |                                               |
   |                                      |               | It is important to have spare disk space for  |
   |                                      |               | data replicas if one of the chunk servers     |
   |                                      |               | fails.                                        |
   +--------------------------------------+---------------+-----------------------------------------------+
   | Replication started, N chunks are    | MDS info      | Generated when the cluster starts automatic   |
   | queued                               |               | data replication to recover the missing       |
   |                                      |               | replicas.                                     |
   +--------------------------------------+---------------+-----------------------------------------------+
   | Replication completed                | MDS info      | Generated when the cluster finishes automatic |
   |                                      |               | data replication.                             |
   +--------------------------------------+---------------+-----------------------------------------------+
   | CS#<N> has reported hard error       | MDS warn      | Generated when the chunk server CS#<N>        |
   | on *path*                            |               | detects disk data corruption.                 |
   |                                      |               |                                               |
   |                                      |               | You are recommended to replace chunk servers  |
   |                                      |               | with corrupted disks as soon as possible with |
   |                                      |               | new ones and to check the hardware for        |
   |                                      |               | errors.                                       |
   +--------------------------------------+---------------+-----------------------------------------------+
   | CS#<N> has not registered during     | MDS warn      | Generated when the chunk server CS#<N> has    |
   | the last *T* sec and is marked       |               | been unavailable for a while. In this case,   |
   | as inactive/offline                  |               | the chunk server first gets marked as         |
   |                                      |               | inactive. After 5 minutes, the state is       |
   |                                      |               | changed to offline, which starts automatic    |
   |                                      |               | replication of data to restore the replicas   |
   |                                      |               | that were stored on the offline chunk server. |
   +--------------------------------------+---------------+-----------------------------------------------+
   | Failed to allocate *N* replicas for  | MDS warn      | Generated when the cluster cannot allocate    |
   | '*path*' by request from             |               | chunk replicas, for example, when it runs     |
   | <addr>:<port> - *K* out of *M*       |               | out of disk space.                            |
   | chunks servers are available         |               |                                               |
   +--------------------------------------+---------------+-----------------------------------------------+
   | Failed to allocate *N* replicas for  | MDS warn      | Generated when the cluster cannot allocate    |
   | '*path*' by request from             |               | chunk replicas because not enough chunk       |
   | <addr>:<port> since only *K* chunk   |               | servers are registered in the cluster.        |
   | servers are registered               |               |                                               |
   +--------------------------------------+---------------+-----------------------------------------------+