Project

General

Profile

CPU occasionally spiking to 100%

Carl Hasselskog
Added almost 5 years ago

Hi,

Our Tigase server (which you helped us set up) usually has very low CPU load but occasionally it spikes to 100% and then just returns back to normal (see attached screenshot of the CPU load).

Last time that occurred I managed to get some thread dumps using jstack. I created several thread dumps to get some "poor-mans CPU sampling".

The thing that caught my attention was the occurrence of several threads with the following stack:

   java.lang.Thread.State: RUNNABLE
    at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
    at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
    at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
    at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87)
    - locked <0x0000000689f5c138> (a sun.nio.ch.Util$2)
    - locked <0x0000000689f5c150> (a java.util.Collections$UnmodifiableSet)
    - locked <0x0000000689f54868> (a sun.nio.ch.EPollSelectorImpl)
    at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98)
    at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:102)
    at tigase.net.SocketThread.run(SocketThread.java:251)
    at java.lang.Thread.run(Thread.java:744)

After some googling I found this thread, which looks to be the same issue: https://github.com/netty/netty/issues/327

Is there any known workaround to this issue for Tigase?

I should add that we are running the following version of Java:

Java(TM) SE Runtime Environment (build 1.7.0_51-b13)

Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode)

I know that you recommended against running Java 7 but we have some libraries that require Java 7, so we are forced to use Java 7 or 8 (I therefore I take full responsibility of this issue if it's Java 7 specific).


Replies (11)

Added by Carl Hasselskog almost 5 years ago

I should also add that we are running Tigase 5.1.5.

Avatar?id=6023&size=32x32

Added by Artur Hefczyc TigaseTeam almost 5 years ago

Yes, this is a bug in JDK, or combination of OS and JDK miscommunication.

The problem is that some user's connection is broken but OS sets a flag on this connection which is interpreted by Java as connection ready to read data from. Tigase attempts to read data but nothing is there, so it puts the connection back to the pool. Of course Java instantly reports the same connection again as ready to read data from. Tigase attempts to read.... This causes kind of an loop of continuously reading TCP/IP connection, hence high CPU usage. There is a logic in Tigase which discovers such broken connections and eventually removes them from a connection pool and closes it. However, it may take some time, especially if there are more connections like this.

Apparently JDK7 from Oracle (SUN) has a fix for the bug but not the OpenJDK. If this still happens for you under the JDK7 I assume you have OpenJDK installed. Please try replacing it with JDK7 from Oracle. We have run extensive tests with JDK7 (from Oracle) and we now recommend using it. We have no much experience with Tigase on JDK8 yet.

You could also try upgrading to Tigase 5.2.1 as we have made lots of improvements in the networking area as well as in other places.

Added by Carl Hasselskog almost 5 years ago

Thanks for your quick response! Unfortunately we are already running Oracle JDK7. Java 7 Update 51 to be precise. Perhaps upgrading to Java 7 Update 55 will solve it? I'll upgrade Tigase as well. Do I just replace the .jar-files in the "libs" and "jars" directories and restart the service? Any db migration or similar that I have to do (we are running 5.1.5 at the moment)?

Added by Wojciech Kapcia TigaseTeam almost 5 years ago

Carl Hasselskog wrote:

I'll upgrade Tigase as well. Do I just replace the .jar-files in the "libs" and "jars" directories and restart the service? Any db migration or similar that I have to do (we are running 5.1.5 at the moment)?

Best way is to extract new version next to the current one, copy over configuration from etc/* and certs/* and then point the symlink to the new version. If anything you have a fallback version.

This will also help with the fact that version 5.2.x and newer merged libs/ and jars/ directory (also, this will most likely require update of the service script from scripts/ directory.

There were no db migration at this point.

Added by Carl Hasselskog almost 5 years ago

Ok, thanks. I will do that.

Added by Carl Hasselskog almost 5 years ago

Upgrading Java to Oracle Java 7 Update 55 didn't help. We still see the same issue.

Unfortunately we ran into some cert issues when upgrading to Tigase 5.2.1. Previously our server cert didn't include the entire cert chain (i.e. only our private key and our cert, not the CA' root cert and the CA's intermediate cert). After including the entire root chain in the pem file we rant into problems with our client (at the moment it doesn't allow a cert chain that is >2 in size). We therefore have to update all our clients before we can upgrade Tigase. That should take a couple of weeks. I will let you know once we are finished doing that.

Added by Carl Hasselskog over 4 years ago

Hi,

We've now upgrade all our clients, so that we can upgrade to Tigase 5.2.1. Unfortunately we ran into some problems when upgrading. Here's what we did:

As you remember we have two servers in a cluster: tigase-1 and tigase-2. In order to minimize downtime and risks we did the following:

  1. Changed the DNS-entry to only point to tigase-1 (so that no client would be connected to tigase-2).

  2. Upgraded tigase-2 to Tigase 5.2.1. as per your instructions.

  3. Started tigase-2 and verified that we could connect by manually changing the host-file in my computer to point to tigase-2 instead of tigase-2.

  4. Change the DNS entry to point only to tigase-2.

By doing so we saw how our clients gradually started connecting with tigase-2 instead of tigase-1. Initially every seemed to be working smoothly. However, after some time started seeing very high load on tigase-2 and eventually it started dropping packets. I guessed it could have been caused by the fact that the initial load is a bit higher when lots of clients connect simultaneously. I therefore tried upgrading tigase-2's EC2 instance type from m1.large to m3.xlarge (and increased the heap size accordingly) and the let it run for a while to get past that initial spike. Unfortunately that didn't help and I therefore started looking into the logs. I found some log entries that looked suspicious and were logged in tigase-2 but I've never seen on tigase-1:

  1. "ClusterConnectionManager.processHandshake() WARNING: Handshaking password doesn't match, disconnecting: null, type: accept, Socket: nullSocket[addr=/10.82.222.173,port=18055,localport=5277], jid: null".

I'm assuming this is because tigase-1 and tigase-2 are running different versions and it this will go away once tigase-1 is upgraded as well?

  1. 2014-06-30 14:07:32.346 [jabber:iq:auth Queue Worker 0] JabberIqAuth.doAuth() WARNING: Can't authenticate with given CallbackHandler

java.io.IOException: Password verification problem.

at tigase.auth.impl.AuthRepoPlainCallbackHandler.handleVerifyPasswordCallback(AuthRepoPlainCallbackHandler.java:117)

at tigase.auth.impl.AuthRepoPlainCallbackHandler.handleCallback(AuthRepoPlainCallbackHandler.java:144)

at tigase.auth.impl.AuthRepoPlainCallbackHandler.handle(AuthRepoPlainCallbackHandler.java:75)

at tigase.xmpp.impl.JabberIqAuth.doAuth(JabberIqAuth.java:350)

at tigase.xmpp.impl.JabberIqAuth.process(JabberIqAuth.java:268)

at tigase.server.xmppsession.SessionManager$ProcessorWorkerThread.process(SessionManager.java:2689)

at tigase.util.WorkerThread.run(WorkerThread.java:132)

Caused by: tigase.db.UserNotFoundException: User does not exist: 28533@production-xmpp.degoo.com

at tigase.db.jdbc.TigaseAuth.plainAuth(TigaseAuth.java:391)

at tigase.db.jdbc.TigaseAuth.otherAuth(TigaseAuth.java:319)

at tigase.db.AuthRepositoryMDImpl.otherAuth(AuthRepositoryMDImpl.java:249)

at tigase.auth.impl.AuthRepoPlainCallbackHandler.handleVerifyPasswordCallback(AuthRepoPlainCallbackHandler.java:110)

at tigase.auth.impl.AuthRepoPlainCallbackHandler.handleCallback(AuthRepoPlainCallbackHandler.java:144)

at tigase.auth.impl.AuthRepoPlainCallbackHandler.handle(AuthRepoPlainCallbackHandler.java:75)

at tigase.xmpp.impl.JabberIqAuth.doAuth(JabberIqAuth.java:350)

at tigase.xmpp.impl.JabberIqAuth.process(JabberIqAuth.java:268)

at tigase.server.xmppsession.SessionManager$ProcessorWorkerThread.process(SessionManager.java:2689)

at tigase.util.WorkerThread.run(WorkerThread.java:132)

This one is very odd because it doesn't occur on tigase-1 and tigase-2 is using the same db. It's also very unlikely that the client provides an incorrect username/password because the password is automatically generated and handled by the client. The user never enters a password. I've also checked in the db and the users that this fails for are available in tigasedb.tig_users.

What do you think is the problem here?

Regards

Carl

Avatar?id=6023&size=32x32

Added by Artur Hefczyc TigaseTeam over 4 years ago

Carl Hasselskog wrote:

Hi,

We've now upgrade all our clients, so that we can upgrade to Tigase 5.2.1. Unfortunately we ran into some problems when upgrading. Here's what we did:

As you remember we have two servers in a cluster: tigase-1 and tigase-2. In order to minimize downtime and risks we did the following:

  1. Changed the DNS-entry to only point to tigase-1 (so that no client would be connected to tigase-2).

  2. Upgraded tigase-2 to Tigase 5.2.1. as per your instructions.

  3. Started tigase-2 and verified that we could connect by manually changing the host-file in my computer to point to tigase-2 instead of tigase-2.

  4. Change the DNS entry to point only to tigase-2.

By doing so we saw how our clients gradually started connecting with tigase-2 instead of tigase-1. Initially every seemed to be working smoothly. However, after some time started seeing very high load on tigase-2 and eventually it started dropping packets. I guessed it could have been caused by the fact that the initial load is a bit higher when lots of clients connect simultaneously. I therefore tried upgrading tigase-2's EC2 instance type from m1.large to m3.xlarge (and increased the heap size accordingly) and the let it run for a while to get past that initial spike. Unfortunately that didn't help and I therefore started looking into the logs.

Yes, this is odd. There were some changes in the Tigase clustering code. I am not sure if they can cause this kind of a problem. I have never tried connecting together 2 different versions in a single cluster. This is definitely not advised and not supported. So I would work on this first. Make sure the same version works on all cluster nodes.

I found some log entries that looked suspicious and were logged in tigase-2 but I've never seen on tigase-1:

  1. "ClusterConnectionManager.processHandshake() WARNING: Handshaking password doesn't match, disconnecting: null, type: accept, Socket: nullSocket[addr=/10.82.222.173,port=18055,localport=5277], jid: null".

I'm assuming this is because tigase-1 and tigase-2 are running different versions and it this will go away once tigase-1 is upgraded as well?

  1. 2014-06-30 14:07:32.346 [jabber:iq:auth Queue Worker 0] JabberIqAuth.doAuth() WARNING: Can't authenticate with given CallbackHandler

java.io.IOException: Password verification problem.

at tigase.auth.impl.AuthRepoPlainCallbackHandler.handleVerifyPasswordCallback(AuthRepoPlainCallbackHandler.java:117)

at tigase.auth.impl.AuthRepoPlainCallbackHandler.handleCallback(AuthRepoPlainCallbackHandler.java:144)

at tigase.auth.impl.AuthRepoPlainCallbackHandler.handle(AuthRepoPlainCallbackHandler.java:75)

at tigase.xmpp.impl.JabberIqAuth.doAuth(JabberIqAuth.java:350)

at tigase.xmpp.impl.JabberIqAuth.process(JabberIqAuth.java:268)

at tigase.server.xmppsession.SessionManager$ProcessorWorkerThread.process(SessionManager.java:2689)

at tigase.util.WorkerThread.run(WorkerThread.java:132)

Caused by: tigase.db.UserNotFoundException: User does not exist: 28533@production-xmpp.degoo.com

at tigase.db.jdbc.TigaseAuth.plainAuth(TigaseAuth.java:391)

at tigase.db.jdbc.TigaseAuth.otherAuth(TigaseAuth.java:319)

at tigase.db.AuthRepositoryMDImpl.otherAuth(AuthRepositoryMDImpl.java:249)

at tigase.auth.impl.AuthRepoPlainCallbackHandler.handleVerifyPasswordCallback(AuthRepoPlainCallbackHandler.java:110)

at tigase.auth.impl.AuthRepoPlainCallbackHandler.handleCallback(AuthRepoPlainCallbackHandler.java:144)

at tigase.auth.impl.AuthRepoPlainCallbackHandler.handle(AuthRepoPlainCallbackHandler.java:75)

at tigase.xmpp.impl.JabberIqAuth.doAuth(JabberIqAuth.java:350)

at tigase.xmpp.impl.JabberIqAuth.process(JabberIqAuth.java:268)

at tigase.server.xmppsession.SessionManager$ProcessorWorkerThread.process(SessionManager.java:2689)

at tigase.util.WorkerThread.run(WorkerThread.java:132)

This one is very odd because it doesn't occur on tigase-1 and tigase-2 is using the same db. It's also very unlikely that the client provides an incorrect username/password because the password is automatically generated and handled by the client. The user never enters a password. I've also checked in the db and the users that this fails for are available in tigasedb.tig_users.

I wonder if this problem might be related to a different password encoding. Also, please double check if the new cluster node connects to the same database. Otherwise, this needs some deeper investigation. You can turn DB debug logging to see what queries are executed and then try these queries yourself to see what you get. There were some differences with password checking logic but these all should be backward compatible.

Can you connect a new Tigase version as standalone (not clustered) from a development machine to see if you can get authentication successful?

Added by Carl Hasselskog over 4 years ago

I've now tried again but this time in standalone mode. More specifically I removed the node from the --cluster-nodes setting in init-properties. Is that enough or should I do something more?

That change got rid of the error messages related to clustering. The password verification error is still there but I suspect that it's due to a client with corrupt settings (the generated password is stored among the settings) because if I look in tigasedb it looks like the client has failed plenty of times before the server upgrade. I will implement client side repair of corrupt settings and see if that solves that particular issue. I don't think it's an encoding issue because all generated passwords are just UUID's so there are few possible encoding ambiguities.

Unfortunately the server still gets overwhelmed by all the traffic (queues overflow). When I look in the logs this time I find something else that looks suspicious. Lots (many per second) log messages that look like this:

2014-07-08 19:41:52.552 [default-handler Queue Worker 3]  SessionManager$ProcessorWorkerThread.process()  INFO: Already error packet, ignoring: from=sess-man@ec2-54-243-30-200.compute-1.amazonaws.com, to=sess-man@ec2-54-243-30-200.compute-1.amazonaws.com, DATA=<presence from="33287@production-xmpp.degoo.com" type="error" to="25475@production-xmpp.degoo.com"><error code="501" type="cancel"><feature-not-implemented xmlns="urn:ietf:params:xml:ns:xmpp-stanzas"/><text xml:lang="en" xmlns="urn:ietf:params:xml:ns:xmpp-stanzas">CData size: 33</text></error></presence>, SIZE=324, XMLNS=null, PRIORITY=PRESENCE, PERMISSION=AUTH, TYPE=error

Could this be the reason for the high load? What do you think is causing this?

I can send you the entire log if you want to but I'd rather send that privately. What's your preferred channel for that?

Regards

Carl

Avatar?id=6023&size=32x32

Added by Artur Hefczyc TigaseTeam over 4 years ago

Carl Hasselskog wrote:

I've now tried again but this time in standalone mode. More specifically I removed the node from the --cluster-nodes setting in init-properties. Is that enough or should I do something more?

We have introduced cluster auto-configuration mechanism some time ago, so they may still try to connect. This depends on the exact version you have. The best way to switch clustering off is to comment out --cluster-mode=true or change it to false.

That change got rid of the error messages related to clustering. The password verification error is still there but I suspect that it's due to a client with corrupt settings (the generated password is stored among the settings) because if I look in tigasedb it looks like the client has failed plenty of times before the server upgrade. I will implement client side repair of corrupt settings and see if that solves that particular issue. I don't think it's an encoding issue because all generated passwords are just UUID's so there are few possible encoding ambiguities.

Ok, I am assuming then, it is not a problem in Tigase. For now, if you find more evidence that there is something wrong with Tigase we can investigate.

Unfortunately the server still gets overwhelmed by all the traffic (queues overflow).

Could you share more details on what number of users do you have, what servers this is running on (I guess this is not m3.xlarge). Do you also know which queues grow up? This would help us pin point the bottleneck.

When I look in the logs this time I find something else that looks suspicious. Lots (many per second) log messages that look like this:

2014-07-08 19:41:52.552 [default-handler Queue Worker 3] SessionManager$ProcessorWorkerThread.process() INFO: Already error packet, ignoring: from=sess-man@ec2-54-243-30-200.compute-1.amazonaws.com, to=sess-man@ec2-54-243-30-200.compute-1.amazonaws.com, DATA=CData size: 33, SIZE=324, XMLNS=null, PRIORITY=PRESENCE, PERMISSION=AUTH, TYPE=error

Could this be the reason for the high load? What do you think is causing this?

Yes, it could. Looks like presence packets are being bonded back and forth with an error. To me it looks like some misconfiguration on the server or on the network.

Does it happen only on the new version of the server or it happens on the old version as well? I am trying to think of some changes we made in the meantime which could affect packets processing.

I can send you the entire log if you want to but I'd rather send that privately. What's your preferred channel for that?

Could you please make a compressed log available somewhere for download and send me a link to my email address? Ideally, I would like you to stop the server, remove all logs, start the server and then send me the logs before they get rotated but after the problem starts to appear. Also to have really complete picture it would be helpful if you could turn debug mode to ON. (This will slow down your server, so once you collect logs, you need to switch it off and again restart the server). Also, if you can collect Tigase full statistics via JMX and send me the txt file it would be useful as well.

Added by Carl Hasselskog over 4 years ago

I've sent you an email now with all the logs.

Regards

Carl

    (1-11/11)