Project

General

Profile

Tigase stops to responding to room history query

Matthew M
Added almost 5 years ago

We are using Tigase 5.2.1 final version. We found a strange problem that's hard to produce predictably but it starts to appear often.

After running Tigase for a day or two, without heavy load, maybe 20-40 concurrent online users, all of sudden, all the client query to get

room history failed -- tigase never respond to those queries, and we did not see any error logs.

We are still trying hard to reproduce it consistently, so far we did not find any clue yet. Would you give us some suggestions that which part of

the code we should look into, or what kind of logs we should pay attention?

It's strange that all other parts of Tigase works fine at the same time, login, presence, even 1-1 chat history query... How does MUC history works different from other part, such as 1-1 history, and this "component" stops responding?


Replies (10)

Avatar?id=6023&size=32x32

Added by Artur Hefczyc TigaseTeam almost 5 years ago

Matthew M wrote:

After running Tigase for a day or two, without heavy load, maybe 20-40 concurrent online users, all of sudden, all the client query to get

room history failed -- tigase never respond to those queries, and we did not see any error logs.

We are still trying hard to reproduce it consistently, so far we did not find any clue yet. Would you give us some suggestions that which part of

the code we should look into, or what kind of logs we should pay attention?

Any exception or WARNING in the log file might be an indication that something went wrong, also if you can get Tigase statistics then you could look at queues sizes for components, MUC in particular.

It's strange that all other parts of Tigase works fine at the same time, login, presence, even 1-1 chat history query... How does MUC history works different from other part, such as 1-1 history, and this "component" stops responding?

It is actually not that strange if you realize that components in Tigase are like a little almost independent programs with own, separate threads pools, own packet queuing system, etc... By design you can deploy a component either as an internal Tigase component or as an external components. So they are pretty independent and this also lowers impact of malfunctioning one component on the rest of the system.

Added by Matthew M almost 5 years ago

Checking tigase statistics is a good idea! Could you share

  1. how to check the various queue statistics (especially related to MUC)?

  2. which Java classes are managing the queues? Is there any parameter we can tune, or we should watch? I am curious how Tigase manages the queue...

  3. what if a malfunctioned client starts to send tons of IQ requests (e.g, MUC history)? Does Tigase has any mechanism to manage it?

Thanks!

Avatar?id=6023&size=32x32

Added by Artur Hefczyc TigaseTeam almost 5 years ago

Matthew M wrote:

Checking tigase statistics is a good idea! Could you share

  1. how to check the various queue statistics (especially related to MUC)?

You can get them either via XMPP using a good client supporting service disco and ad-hoc commands or JMX using either a command line tool or JConsole or something like this. There are other topics and articles on Tigase website with details.

  1. which Java classes are managing the queues? Is there any parameter we can tune, or we should watch? I am curious how Tigase manages the queue...

It is too early to tune if we do not know what is going on. We do not even know if the queues are filling or if this is something else.

  1. what if a malfunctioned client starts to send tons of IQ requests (e.g, MUC history)? Does Tigase has any mechanism to manage it?

Tigase has automatic traffic throttling on per-connection basis. So it should generally not cause much problems.

Added by Matthew M almost 5 years ago

I used JMX monitor via HTTP, and found a few items related to "queues" in this URI: ViewObjectRes//tigase.stats%3Atype%3DStatisticsProvider ,

such as

  • CLIOQueueSize

  • CLQueueSize

  • QueueOverflow

  • QueueSize

  • SMQueueSize

Do you know which one contains packets related to MUC? Does QueueSize contains the packets in all all the queues?

Avatar?id=6023&size=32x32

Added by Artur Hefczyc TigaseTeam almost 5 years ago

The HTTP interface does not show all metrics. So I recommend to use either JMX port to retrieve all metrics or XMPP connection (with Psi client for example) to retrieve statistics.

Each component has own queues, so the fist thing would be looking into queues for MUC component but if any of the queue is non-empty this is the place to start investigation.

Added by Wojciech Kapcia TigaseTeam almost 5 years ago

Matthew M wrote:

I used JMX monitor via HTTP, and found a few items related to "queues" in this URI: ViewObjectRes//tigase.stats%3Atype%3DStatisticsProvider ,

such as

  • CLIOQueueSize

  • CLQueueSize

  • QueueOverflow

  • QueueSize

  • SMQueueSize

Do you know which one contains packets related to MUC? Does QueueSize contains the packets in all all the queues?

You can get all statistics over HTTP using different URI (at the bottom of your link you have getComponentStats:

/InvokeAction//tigase.stats%3Atype%3DStatisticsProvider/action=getComponentStats?action=getComponentStats&compName%2Bjava.lang.String=muc&level%2Bint=0

There are two parameters - component name (muc) and lever (0 - all); from there you can get list of all statistics regarding desired component.

Added by Matthew M almost 5 years ago

We guessed there might due to message dropping, as there is a log statement in AbstractMessageReceiver.java:

                log.log(Level.FINEST, "Packet dropped for unknown reason: {0}", packet);

I wonder is there any reason this is logged a FINEST level? Is there any warning or error logs when packets are dropped? We set our log to INFO level and could not spot any warnings or errors.

Avatar?id=6023&size=32x32

Added by Artur Hefczyc TigaseTeam almost 5 years ago

This line is highly unlikely to be executed at all. I have never seen the counter which is incremented here to show anything except 0. There is a very slim chance why it may happen according to Java API but it could only happen when Tigase is under extremely high load and component waits in queue.put for a long time.

And the reason for the FINEST level is just because it may only happen under extremely high load and potentially under a high load it may happen often, very often. Then printing a high volume of debug messages to log file would only make things worse as heavy logging it slows down the server. The counter is incremented to the occurrence of the event can be detected without the log entry.

Added by Matthew M almost 5 years ago

It makes a lot of sense not to print excessive logs to avoid the extra load. We are thinking to print the warning log ONLY once to avoid excessive logs but still gives a heads in error logs.

The counter is incremented to the occurrence of the event can be detected without the log entry.

What's the best way to detect it? Via JMX?

Avatar?id=6023&size=32x32

Added by Artur Hefczyc TigaseTeam almost 5 years ago

Yes, JMX is the way, there is about thousand Tigase metrics which can be retrieved over JMX or XMPP.

    (1-10/10)