Tigase server monitoring


Artur Hefczyc TigaseTeam
Added almost 9 years ago

This is our the most recent addition to the Tigase server implementation: automatic and self-monitoring tools inside the server. This stuff is exceptionally useful for advanced users and those who manage Tigase installations working under a high load.

There have always been statistics in the server accessible via ad-hoc commands which could provide you with lots of details about the server internal states, the load on the server, performance and so on. The disadvantage of using them was that you need to periodically check the statistics and actively monitor the server in order to detect that something might go wrong.

Alternatively you could write a bot which could pull statistics periodically, analyze them and send a notification if something goes wrong. This solved the problem with manual checking statistics but still monitoring of the server is somehow limited.

To overcome all the limitations and even expand possibilities to monitor the state of the Tigase installation we have added automatic monitoring tools to the server. They allow to early detect if something goes wrong or if the server is under high load and can't keep up with processing all the data.

It consists of 2 parts:

  1. Monitor Framework and API - which creates a basement for the monitoring system and allows interaction between a user and the monitoring system.
  2. Resource Monitors - which are small plugins responsible for monitoring certain resource. It's kind of Unix like approach to have small and simple applications responsible for just one thing.

The monitoring system allows for 2 types of interactions:

  1. Automatic mode - when all monitor plugins track resources usage and if they detect that the specified threshold has been reached they can send a notification alert to specified set of JIDs. 
  2. Interactive mode - the monitoring framework as a whole and each monitoring plugin can respond to interactive commands. In most cases they are to retrieve resource usage on demand or to adjust monitoring parameters.

 At the moment we have working 4 resource monitors:

  1. CPU Monitor - it tracks the CPU utilization. If it stays above certain level for specified amount of time a notification alert is send to specified JIDs addresses. It checks the percentage of the CPU usage by the Tigase server as well as the system CPU load average.
  2. Memory Monitor - it tracks utilization of all memory types in the Tigase server. It is related to JVM memory usage specifics and there are 2 main types of memory used by the Java applications: Heap and non-heap which are even further divided into separate memory pools. The monitor tracks usage of all of them and sends a notification of the threshold is reached for any memory pool used by the JVM.
  3. HDD Monitor - even though the Tigase server uses HDD only for logging which in theory can be switched off it does use a database. In most cases HDD usage stays quite low but sometimes early notification that the server runs out of disk space is very useful. The monitor tracks usage of all partitions/disks detected on the system.
  4. Log Monitor - this is quite interesting thing. It keeps track of all logger records created by the server components. It keeps last N records in memory and if it catches any WARNING or SEVERE log entry it sends a notification to specified list of JIDs with all log entries in the buffer. This monitor is exceptionally useful for several reasons. Usually heavy logging affects the server performance and if the server is under high load most of logs became useless due to a number of data in there. As logs are usually recycled useful information may quickly disappear before we can start looking for them. This Monitor looks for a critical entries and sends all log entries directly preceding the event. Therefore it is much easier to find out what leaded to the exceptional case. More over this logger is very cheap in terms of resource usage. It practically eliminates a need for log files. 

Obviously access to the server monitoring system is restricted and can be granted on per JID basis.

Good monitoring tools allow to much easier discover the problem or sometimes they can warn administrator before the problem even occurred so some actions can be taken to prevent the problem to happen.

If you have any ideas or suggestions what else could be monitored by automatic resource monitors we are happy to hear your suggestions and include more features in next release.