meaning of jmx property - total queues wait
We run Tigase v7.1 in a cluster, with some custom plugins. We also have implementation of PacketFilterIfc as a Session Manager Outgoing Filter. We know this does introduce delays and we see the numbers fluctuating between 0 to 125 in the following jmx metrics:
total/Total queues wait
sess-man/Total queues wait
I assume 'total/Total queues wait' is the sum total of the individual tigase components'- Total queues wait.
I'm attaching a graph of total/Total queues wait from one of the nodes in the cluster with issue and during a normal operation.
The questions I have are these:
1.) We see that number rise, and most certainly due to the delay introduced by the filter, but why does it fall flat at a point? In this case this happens to be around 40k, which is way less than the sess-man/Max queue size of 644096
2.) If you notice, there's another spike between 8:00am and 10:00am. At that point the numbers grow past 60k but soon come back down. We know that network delays were introduced during those hours in our datacenter and even after recovery, the minimum remains to hover around 40k. We are trying to figure out what this 40k means. Why doesn't this metric reset?
3.) Are these lost messages or stuck messages, etc? If these are stuck messages, are they still occupying the queue? Can they be flushed or retried for processing?
4.) We would have assumed that these are messages accumulated in the queues because the rate at which they are processed is slower than the incoming rate. But if that were true, the 40k number should go down around midnight when we don't have a lot of messages coming in.
We are moving away from a delay introducing Session Manager PacketFilter, by using a new Component to do all such processing, but we want to learn what these numbers mean.
Let me know if you need more info to answer the questions.
Added by Artur Hefczyc 5 months ago
It is really hard to tell what is going on. It does look strange indeed.
Looks like your understanding of the queues and how it works on the Tigase side is correct.
Normally "queues wait" should be emptied as time goes. It is normal that it goes up and down and it is normal that at peak times it goes up and stays high for some time but eventually it should go down.
If the system cannot cope with the load for a long time, the queue should go up to the maximum and than it should start overfill and counter stats showing packets lost (queue overflow) should start to grow.
On your system the situation is very strange. It, for sure, needs more investigation. The counter sess-man/Total queues wait shows a sum of all SM queues inside the component. It would be good to see which exact queue it is.
One idea which comes to my mind is, as each queue has 1 or more threads dedicated to process packets from this particular queue, if that threads is either stuck in some processing or just broken/terminated, then this queue would never be emptied. So the queue would be full and always at the same number. More packets, as they come would be lost and there is a separate metrics for this.