Project

General

Profile

JMX monitoring

HaXkil The
Added almost 5 years ago

Hi.

We have an issue with JMX monitoring, which seem to be appeared after we switched to latest available Tigase + ACS.

StatisticsProvider@'s attributes @MessagesNumber and PresencesNumber always reports 0,

even though operations like getAllStats reports correct non-ziro values.

Do you have any ideas, why this thing happens on our two-node Tigase setup?


Replies (19)

Avatar?id=6023&size=32x32

Added by Artur Hefczyc TigaseTeam almost 5 years ago

There could be a slight change in some of the metrics names. Please check the metrics returned from the getAllStats function and compare names you use to download individual metrics.

Added by HaXkil The almost 5 years ago

Hi.

I've ran into another thing with JMX monitoring. Connection number significantly increased in comparison with the previous Tigase 5.1x version.

ConnectionsNumber reported by the StatisticsProvider is constantly growing ( till ~ 1000 ), even though no clients connected to the server.

I've switched from cluster setup to single instance setup with all custom plugins being disabled, but still observe this issue.

Do you think custom implementation of SASL callbacks or ACS enhancements may cause this behavior?

I've attached init.properties, I've used.

Thank you in advance.

Added by Wojciech Kapcia TigaseTeam almost 5 years ago

Again - in addition to getting metrics for ConnectionsNumber attribute could you also check number of connection using getAllStats function and compare the values to: bosh/Open connections and @c2s/Open connections@?

What type of connections (bosh? socket?) do you usually/mostly serve?

Added by HaXkil The almost 5 years ago

Clients are using bosh and xmpp connections.

In case ( I've described in previous comment ) when no clients are connected, function getAllStats returns significant number of connections as well:

bosh/Open connections=285 and c2s/Open connections=740@; while @StatisticsProvider:ConnectionsNumber shows ~1000 connections.

Added by Wojciech Kapcia TigaseTeam almost 5 years ago

StatisticsProvider:ConnectionsNumber is calculated based on number of socket and bosh connections so it would be correct and the issue stem from numbers reported by connection managers (bosh, socket). It may happen that some clients doesn't close the connection properly hence Tigase may 'think' that they are still connected and report them in the statistics.

There is a Watchdog mechanism that periodically check all connections and clean up those that are not valid anymore hence after a while reported number of connections should go down and reflect correct number of connected users.

Also - latest nightly should contain some fixes related to discovering connections that no-longer exists - please try it and check whether the statistics are now correct.

Added by HaXkil The almost 5 years ago

My first thought was: clients just don't close connections appropriately; so I ran server ( 2 nodes ) and checked, that nobody is connected.

But numbers of the opened connections, reported by JMX started growing.

getAllStats(800) returned me something like

message-router/Local hostname=xxx.host.private
message-router/Uptime=6 mins, 30 sec
message-router/CPU usage=1.2%
message-router/Max Heap mem=1,014,528 KB
message-router/Used Heap=216,533 KB
bosh/Open connections=196
c2s/Open connections=198
cl-comp/Open connections=4
sess-man/Open user connections=2
sess-man/Maximum user connections=2
sess-man/Open user sessions=3
cluster-strat/Connected nodes=1

Even though nobody connected to that server, Session manager reported about 2 opened connections. I though this is just another node.

I've checked that server in few hours, but amount of opened connections was ~1000.

Added by Wojciech Kapcia TigaseTeam almost 5 years ago

Hm, interesting. Server can't spontaneously create connections, something needs to open them. Can you check monitor on the system level number of established connections to the Tigase? Can you also try to run Tigase with the 'vanilla' configuration? I've tried to reproduce the issue with similar setup an I wasn't simply able to.

Which version of Tigase are you using actually? Latest stable or latest nightly?

Added by Sam Wright almost 5 years ago

I have also seen similar issues using ws2s/Open connections.

When I do netstat -lpna | grep 5290 | wc -l which counts the number of connections connected to the websocket port on one server, I get ~30 open websocket connections. When I look at JMX ws2s/Open connections I see around ~800/server. When I login to the database to see the number of users created I only see 1861 but summing ws2s/Open connections across all servers gives me well over 2000 users online.

It would make sense that some of these connections are "old" but it doesn't make sense that the system reports 30 connections currently open and JMX reports 800 open connections.

Any thoughts?

Added by HaXkil The almost 5 years ago

Your thoughts pointed me to the right direction, so I've figured out the reason for unexpected connections, thank you.

All those connections were created by load balancer, which pings Tigase server continuously on 5222 and 5280 ports,

to check, whether server is alive.

Could you suggest a way to check, whether Tigase server is alive, without causing it to create new connection?

Is there any kind of health check HTTP request?

p.s. We are using latest nightly build.

Added by Wojciech Kapcia TigaseTeam almost 5 years ago

Sam Wright wrote:

I have also seen similar issues using ws2s/Open connections.

What is your usage? is it under regular load or under the stress/load test?

When I do netstat -lpna | grep 5290 | wc -l which counts the number of connections connected to the websocket port on one server, I get ~30 open websocket connections. When I look at JMX ws2s/Open connections I see around ~800/server.

Could you observer how number of connections behaves over time in relation to connected/disconnected clients? Do the number of connections decrease at any time websocket connection is gone?

When I login to the database to see the number of users created I only see 1861 but summing ws2s/Open connections across all servers gives me well over 2000 users online.

This is a bit confusing - how do you check number of online users in the database? There are users there but there is no valid and reliable way to check number of online users there.

It would make sense that some of these connections are "old" but it doesn't make sense that the system reports 30 connections currently open and JMX reports 800 open connections.

If your system is running under particular load there is a slim possibility of this happening but then again - it would be easier to check if you could provide more details about the usage.

HaXkil The wrote:

Your thoughts pointed me to the right direction, so I've figured out the reason for unexpected connections, thank you.

All those connections were created by load balancer, which pings Tigase server continuously on 5222 and 5280 ports,

to check, whether server is alive.

Could you suggest a way to check, whether Tigase server is alive, without causing it to create new connection?

Is there any kind of health check HTTP request?

If you want to make sure Tigase is alive, the processing is ok and it's possible to establish connection then those pings seems like a good idea. Other solutions won't give you full picture whether the server correctly responds to requests.

There are internal tools in Tigase which allow subscribing to monitoring events and monitoring resources (CPU, Memory) usage as well as getting notifications for exceptions but this touches the problem from slightly different angle.

Added by Sam Wright almost 5 years ago

Wojciech Kapcia wrote:

Sam Wright wrote:

I have also seen similar issues using ws2s/Open connections.

What is your usage? is it under regular load or under the stress/load test?

This is during both 0 load and stress test load (the stress test would hit 5222 for XMPP not 5290 for WebSockets). I've also determined a health check causing the extra connections despite no users connecting to the ports. But Tigase should not be reporting these extra connections.

When I do netstat -lpna | grep 5290 | wc -l which counts the number of connections connected to the websocket port on one server, I get ~30 open websocket connections. When I look at JMX ws2s/Open connections I see around ~800/server.

Could you observer how number of connections behaves over time in relation to connected/disconnected clients? Do the number of connections decrease at any time websocket connection is gone?

The number of connections does drop off (garbage collection) but with a cluster of

When I login to the database to see the number of users created I only see 1861 but summing ws2s/Open connections across all servers gives me well over 2000 users online.

This is a bit confusing - how do you check number of online users in the database? There are users there but there is no valid and reliable way to check number of online users there.

Ignore this. This a Red Herring and not related/applicable.

It would make sense that some of these connections are "old" but it doesn't make sense that the system reports 30 connections currently open and JMX reports 800 open connections.

If your system is running under particular load there is a slim possibility of this happening but then again - it would be easier to check if you could provide more details about the usage.

I have 2 different systems currently under different types of load but both exhibit the same symptoms.

Environment A has users hitting the cluster on the websocket connection and JMX is reporting more websocket connections than we have users connected.

Environment B has no users connecting through websockets and has 100k+ users connected via XMPP but JMX is reporting between 350 and 500 connections per server on ws2s/Open connections.

Health Checks in both systems are on 5 second intervals with a time out of 16 seconds.

Added by Wojciech Kapcia TigaseTeam almost 5 years ago

Sam Wright wrote:

Wojciech Kapcia wrote:

Sam Wright wrote:

I have also seen similar issues using ws2s/Open connections.

What is your usage? is it under regular load or under the stress/load test?

This is during both 0 load and stress test load (the stress test would hit 5222 for XMPP not 5290 for WebSockets). I've also determined a health check causing the extra connections despite no users connecting to the ports. But Tigase should not be reporting these extra connections.

Could you clarify how are you exactly distinguising between health check connection and user connection? ws2s component is connection manager. It reports all open connections to that particular component, not only those fully authenticated user sessions (those are covered by separate statistics provided by session-manager).

When I do netstat -lpna | grep 5290 | wc -l which counts the number of connections connected to the websocket port on one server, I get ~30 open websocket connections. When I look at JMX ws2s/Open connections I see around ~800/server.

Could you observer how number of connections behaves over time in relation to connected/disconnected clients? Do the number of connections decrease at any time websocket connection is gone?

The number of connections does drop off (garbage collection) but with a cluster of

"cluster of…"?

It would make sense that some of these connections are "old" but it doesn't make sense that the system reports 30 connections currently open and JMX reports 800 open connections.

If your system is running under particular load there is a slim possibility of this happening but then again - it would be easier to check if you could provide more details about the usage.

I have 2 different systems currently under different types of load but both exhibit the same symptoms.

Environment A has users hitting the cluster on the websocket connection and JMX is reporting more websocket connections than we have users connected.

Environment B has no users connecting through websockets and has 100k+ users connected via XMPP but JMX is reporting between 350 and 500 connections per server on ws2s/Open connections.

Health Checks in both systems are on 5 second intervals with a time out of 16 seconds.

It seems everything boils down to how exactly are you performing your 'health check' - could you clarify on this?

Added by Sam Wright almost 5 years ago

Wojciech Kapcia wrote:

Sam Wright wrote:

Wojciech Kapcia wrote:

Sam Wright wrote:

I have also seen similar issues using ws2s/Open connections.

What is your usage? is it under regular load or under the stress/load test?

This is during both 0 load and stress test load (the stress test would hit 5222 for XMPP not 5290 for WebSockets). I've also determined a health check causing the extra connections despite no users connecting to the ports. But Tigase should not be reporting these extra connections.

Could you clarify how are you exactly distinguising between health check connection and user connection? ws2s component is connection manager. It reports all open connections to that particular component, not only those fully authenticated user sessions (those are covered by separate statistics provided by session-manager).

In Environment B (which is an internal testing environment) we have no user connected by ws2s. Users are only connected via the c2s port using a Tsung cluster to manage the users. Turning off the health check drops the number of ws2s connections to 0. When turning the health check on the connections climb back to the same numbers. This is definitely caused by the health check.

When I do netstat -lpna | grep 5290 | wc -l which counts the number of connections connected to the websocket port on one server, I get ~30 open websocket connections. When I look at JMX ws2s/Open connections I see around ~800/server.

Could you observer how number of connections behaves over time in relation to connected/disconnected clients? Do the number of connections decrease at any time websocket connection is gone?

The number of connections does drop off (garbage collection) but with a cluster of

"cluster of…"?

Whoa. Sorry about that. Must have lost this reply in the huge list of ongoing replies we have. The number of connections does drop off (this is sudden so I'm assuming it's garbage collection) but the number of connections is always climbing due to the health check.

It would make sense that some of these connections are "old" but it doesn't make sense that the system reports 30 connections currently open and JMX reports 800 open connections.

If your system is running under particular load there is a slim possibility of this happening but then again - it would be easier to check if you could provide more details about the usage.

I have 2 different systems currently under different types of load but both exhibit the same symptoms.

Environment A has users hitting the cluster on the websocket connection and JMX is reporting more websocket connections than we have users connected.

Environment B has no users connecting through websockets and has 100k+ users connected via XMPP but JMX is reporting between 350 and 500 connections per server on ws2s/Open connections.

Health Checks in both systems are on 5 second intervals with a time out of 16 seconds.

It seems everything boils down to how exactly are you performing your 'health check' - could you clarify on this?

I agree. I am not sure of all the details of the health check. I know it opens a TCP connection on a port and as long as the connection is successful it should close it. However, I do not believe the health check is the root cause. My concern is that the netstat utility shows less connections on port 5290(websockets) than JMX(ws2s/Open connections). Shouldn't these numbers be the same despite the source of the connections?

Added by HaXkil The almost 5 years ago

Hi. We need to provide a way for load balancers to check, whether Tigase is alive and running.

Before load balancers just sent random data to XMPP and BOSH ports. This cause Tigase server to have ~ 1000 connections, created because of those health checks.

To avoid having all those unwanted connections load balancers started sending next data

<stream:stream to='di' xmlns='jabber:client' xmlns:stream='http://etherx.jabber.org/streams' version='1.0'></stream:stream>

over TCP.

This data makes Tigase create and immediately close connection. This works good for XMPP, but it cause Tigase throw an exception in case of BOSH:

2014-05-14 15:19:18,705 ERROR              t.d.AbstractMessageReceiver [in_9-bosh] - [in_9-bosh] Exception during packet processing: from=sess-man@xxx.general.private, to=bosh@xxx.general.private/10.0.0.0_5280_10.0.0.1_44258, DATA=<iq to="bosh@xxx.general.private/10.0.0.0_5280_10.0.0.1_44258" id="9889e04a-d84e-4ac6-a890-1adf9248d188" type="result" from="sess-man@xxx.general.private"><command node="GETFEATURES" xmlns="http://jabber.org/protocol/commands"><ver xmlns="urn:xmpp:features:rosterver"/></command></iq>, SIZE=334, XMLNS=null, PRIORITY=NORMAL, PERMISSION=NONE, TYPE=result
java.lang.IllegalArgumentException: Invalid UUID string: 10.0.0.0_5280_10.0.0.1_44258
    at java.util.UUID.fromString(Unknown Source) ~[na:1.7.0_06]
    at tigase.server.bosh.BoshConnectionManager.getBoshSession(BoshConnectionManager.java:811) ~[tigase-server-5.2.0.jar:5.2.0-b3447/48635d0a (2014-02-12/17:29:15)]
    at tigase.server.bosh.BoshConnectionManager.processCommand(BoshConnectionManager.java:640) ~[tigase-server-5.2.0.jar:5.2.0-b3447/48635d0a (2014-02-12/17:29:15)]
    at tigase.server.xmppclient.ClientConnectionManager.processPacket(ClientConnectionManager.java:156) ~[tigase-server-5.2.0.jar:5.2.0-b3447/48635d0a (2014-02-12/17:29:15)]
    at tigase.server.bosh.BoshConnectionManager.processPacket(BoshConnectionManager.java:181) ~[tigase-server-5.2.0.jar:5.2.0-b3447/48635d0a (2014-02-12/17:29:15)]
    at tigase.server.AbstractMessageReceiver$QueueListener.run(AbstractMessageReceiver.java:1475) ~[tigase-server-5.2.0.jar:5.2.0-b3447/48635d0a (2014-02-12/17:29:15)]

Do you have any ideas, what kind of data should we send to BOSH port, in order to check Tigase is alive, and not to cause Tigase have new connection because of that request?

Avatar?id=6023&size=32x32

Added by Artur Hefczyc TigaseTeam almost 5 years ago

For Bosh you can just send this:

<body/>

Added by Wojciech Kapcia TigaseTeam almost 5 years ago

Sam Wright wrote:

Could you clarify how are you exactly distinguising between health check connection and user connection? ws2s component is connection manager. It reports all open connections to that particular component, not only those fully authenticated user sessions (those are covered by separate statistics provided by session-manager).

In Environment B (which is an internal testing environment) we have no user connected by ws2s. Users are only connected via the c2s port using a Tsung cluster to manage the users. Turning off the health check drops the number of ws2s connections to 0. When turning the health check on the connections climb back to the same numbers. This is definitely caused by the health check.

OK, for clarification so far general ConnectionNumber over JMX hasn't included WebSocket connections, only BOSH and Socket connections.

"cluster of…"?

Whoa. Sorry about that. Must have lost this reply in the huge list of ongoing replies we have. The number of connections does drop off (this is sudden so I'm assuming it's garbage collection) but the number of connections is always climbing due to the health check.

I have 2 different systems currently under different types of load but both exhibit the same symptoms.

Environment A has users hitting the cluster on the websocket connection and JMX is reporting more websocket connections than we have users connected.

Environment B has no users connecting through websockets and has 100k+ users connected via XMPP but JMX is reporting between 350 and 500 connections per server on ws2s/Open connections.

Health Checks in both systems are on 5 second intervals with a time out of 16 seconds.

It seems everything boils down to how exactly are you performing your 'health check' - could you clarify on this?

I agree. I am not sure of all the details of the health check. I know it opens a TCP connection on a port and as long as the connection is successful it should close it. However, I do not believe the health check is the root cause. My concern is that the netstat utility shows less connections on port 5290(websockets) than JMX(ws2s/Open connections). Shouldn't these numbers be the same despite the source of the connections?

They should be, but… Number of /Open connections reported by Tigase maps directly to number of active IO services so if there were problem with disconnection number of open connection may be (for some time) reported as bigger than it is. Here comes my concerns:

  • are there any exceptions in the logs when you perform health-check?

  • can you share more details regarding how do you exactly perform healt-check?

Added by Sam Wright almost 5 years ago

Wojciech Kapcia wrote:

Sam Wright wrote:

Could you clarify how are you exactly distinguising between health check connection and user connection? ws2s component is connection manager. It reports all open connections to that particular component, not only those fully authenticated user sessions (those are covered by separate statistics provided by session-manager).

In Environment B (which is an internal testing environment) we have no user connected by ws2s. Users are only connected via the c2s port using a Tsung cluster to manage the users. Turning off the health check drops the number of ws2s connections to 0. When turning the health check on the connections climb back to the same numbers. This is definitely caused by the health check.

OK, for clarification so far general ConnectionNumber over JMX hasn't included WebSocket connections, only BOSH and Socket connections.

I'm not using general ConnectionNumber over JMX I'm using getAllStats() which returns ws2s/Open connections, c2s/Open connections, bosh/Open connections.

"cluster of…"?

Whoa. Sorry about that. Must have lost this reply in the huge list of ongoing replies we have. The number of connections does drop off (this is sudden so I'm assuming it's garbage collection) but the number of connections is always climbing due to the health check.

I have 2 different systems currently under different types of load but both exhibit the same symptoms.

Environment A has users hitting the cluster on the websocket connection and JMX is reporting more websocket connections than we have users connected.

Environment B has no users connecting through websockets and has 100k+ users connected via XMPP but JMX is reporting between 350 and 500 connections per server on ws2s/Open connections.

Health Checks in both systems are on 5 second intervals with a time out of 16 seconds.

It seems everything boils down to how exactly are you performing your 'health check' - could you clarify on this?

I agree. I am not sure of all the details of the health check. I know it opens a TCP connection on a port and as long as the connection is successful it should close it. However, I do not believe the health check is the root cause. My concern is that the netstat utility shows less connections on port 5290(websockets) than JMX(ws2s/Open connections). Shouldn't these numbers be the same despite the source of the connections?

They should be, but… Number of /Open connections reported by Tigase maps directly to number of active IO services so if there were problem with disconnection number of open connection may be (for some time) reported as bigger than it is. Here comes my concerns:

  • are there any exceptions in the logs when you perform health-check?

  • can you share more details regarding how do you exactly perform healt-check?

No exceptions.

However, I looked into different types of health checks given by the load balancer. We were using a TCP health check with no payload. I've since solved the problem by using different types of health checks. For websockets I used a partial TCP health check. It waits for the monitor to receive the SYN-ACK packet and doesn't finish the TCP handshake. For Bosh I used the tag as Artur suggested (did not confirm it worked) and for Socket/XMPP I used the XMPP Stanza from above by HaXkil.

At this point the issue has be resolved for me. The issue was the health checks although I do question if tigase it handling the connections properly or my understanding of the health checks I'm using is flawed.

For reference, here's the health check documentation for our F5 load balancer: http://support.f5.com/kb/en-us/products/big-ip_ltm/manuals/product/ltm_configuration_guide_10_0_0/ltm_monitors.html#1201151

Added by Wojciech Kapcia TigaseTeam almost 5 years ago

Sam Wright wrote:

It seems everything boils down to how exactly are you performing your 'health check' - could you clarify on this?

I agree. I am not sure of all the details of the health check. I know it opens a TCP connection on a port and as long as the connection is successful it should close it. However, I do not believe the health check is the root cause. My concern is that the netstat utility shows less connections on port 5290(websockets) than JMX(ws2s/Open connections). Shouldn't these numbers be the same despite the source of the connections?

They should be, but… Number of /Open connections reported by Tigase maps directly to number of active IO services so if there were problem with disconnection number of open connection may be (for some time) reported as bigger than it is. Here comes my concerns:

  • are there any exceptions in the logs when you perform health-check?

  • can you share more details regarding how do you exactly perform healt-check?

No exceptions.

However, I looked into different types of health checks given by the load balancer. We were using a TCP health check with no payload. I've since solved the problem by using different types of health checks. For websockets I used a partial TCP health check. It waits for the monitor to receive the SYN-ACK packet and doesn't finish the TCP handshake. For Bosh I used the tag as Artur suggested (did not confirm it worked) and for Socket/XMPP I used the XMPP Stanza from above by HaXkil.

At this point the issue has be resolved for me. The issue was the health checks although I do question if tigase it handling the connections properly or my understanding of the health checks I'm using is flawed.

For reference, here's the health check documentation for our F5 load balancer: http://support.f5.com/kb/en-us/products/big-ip_ltm/manuals/product/ltm_configuration_guide_10_0_0/ltm_monitors.html#1201151

We will be looking into number of reported connections under some circumstances.

Added by Wojciech Kapcia TigaseTeam almost 5 years ago

Sam Wright wrote:

The issue was the health checks although I do question if tigase it handling the connections properly or my understanding of the health checks I'm using is flawed.

Please try next nightly as it include fix for closing websocket connections for which negotiation failed which in turn should mitigate issue with the statistics.

    (1-19/19)