Project

General

Profile

Strange error in tigase-console.log

David Leder
Added almost 2 years ago

Hi everyone

I'm running a 4-backend cluster with ~50k simultaneous connections each node. Occasionally I see the following entry in the tigase-console.log:

stream:error termination

Unfortunately I have no idea what could cause this. I do also have some NotAuthorizedException exceptions and I think they are coming from some race-condition while authorizing through the custom auth plugin.

Can someone tell me what's the origin of the "stream: error termination" and how I might debug it?

Best Regards

Dave


Replies (3)

Avatar?id=6023&size=32x32

Added by Artur Hefczyc TigaseTeam almost 2 years ago

What version of the Tigase server do you run? What kind of load, in terms of messages per second, or how many user logins per second do you have? Do you run Tigase with ACS in a cluster mode?

50k connections / node might be a lot or might be not that much, it depends on what kind of machine you use as a server. Is it real HW or a VM?

The kind of errors do you see might be caused by timeouts. If you have lots of traffic and many user logins per second you may experience authentication timeouts which may result in errors you see. What kind of DB do you use? Are you sure the DB can handle the load?

Added by David Leder almost 2 years ago

I'm still running 5.2 on that system.

I've got about 2000 User logins/sec (BOSH with reconnect every 30sec) which probably also generates the highest load on everything. The external auth DB should not be a problem, it runs "mysql Ver 14.14 Distrib 5.5.51-38.1, for Linux (x86_64) using readline 5.1" and is a direct slave replication of the main DB (there is no write activity on that node). The tigase user DB is running mysqld Ver 5.5.30-30.2 for Linux on x86_64 (Percona Server (GPL), Release rel30.2, Revision 509)".

The tigase monitoring plugin always shows 0 for MSG/sec, which is probably not right (I'm using it to synchronize end-user measurements on multiple devices for a single customer). However I cannot state any reliable MSG/sec figure.

Avatar?id=6023&size=32x32

Added by Artur Hefczyc TigaseTeam almost 2 years ago

David Leder wrote:

I'm still running 5.2 on that system.

Ok, that's pretty old. You mention something about custom auth plugin. Is this your custom code? Race condition is a possibility but given details below, unlikely cause of the problem. See below for more comments.

I've got about 2000 User logins/sec (BOSH with reconnect every 30sec) which probably also generates the highest load on everything.

Ok, this might actually cause lots of problems, performance and resource related. With approx 50k connections per node, you have occupied 50k file descriptors/network connections. However, 2,000 new connections/disconnections per second means that you also have additional, at least 200k file descriptors/network connections in use. This is because by default, a closed connection does not return resources back to the system right away. The closed connection stays in WAIT state for... that depends on the OS configuration. Usually something between 60 - 300 seconds.

So you may have a huge number of connections in WAIT state on your server which may even cause you running out of available file descriptors or network connections. Depending on how exactly you have the system deployed, if you use Proxy or a firewall this may cause out of resources problem on these devices too.

Even if there are routers between your clients and the server, these may cause sudden connections drop or connection unavailability with such a high connect/disconnect rate.

Therefore, it is definitely possible to find out what is causing the problem but in your case it may require a thorough investigation.

The tigase monitoring plugin always shows 0 for MSG/sec, which is probably not right (I'm using it to synchronize end-user measurements on multiple devices for a single customer). However I cannot state any reliable MSG/sec figure.

This maybe server/monitor version mismatch. I suggest you to try to get statistics from the Tigase server and get the metrics directly from there.

    (1-3/3)