Project

General

Profile

Probem with multiple bosh connections

Sean Drummond
Added about 5 years ago

Hi guys we are getting some pretty concerning log messages when doing some simply load testing with bosh from one server. After we get up to about 25 users (each with ~50 buddies) we get the following log messages (several hundred per second):

2014-02-10 21:28:58.338 [ConnectionOpenThread] ConnectionManager.serviceStarted() WARNING: bosh: Attempt to add different service with the same ID: null, type: accept, Socket: nullSocket[addr=/172.19.64.116,port=58954,localport=5280], jid: null

2014-02-10 21:28:58.338 [ConnectionOpenThread] ConnectionManager.serviceStarted() WARNING: bosh: Attempt to add different service with the same ID: null, type: accept, Socket: nullSocket[addr=/172.19.64.116,port=44566,localport=5280], jid: null

2014-02-10 21:28:58.339 [ConnectionOpenThread] ConnectionManager.serviceStarted() WARNING: bosh: Attempt to add different service with the same ID: null, type: accept, Socket: nullSocket[addr=/172.19.64.116,port=44515,localport=5280], jid: null

and

2014-02-10 21:29:00.603 [ConnectionOpenThread] ConnectionOpenThread.run() WARNING: Cant obtain socket channel from selection key, throttling activated = {0}, for port: {1}

2014-02-10 21:29:00.603 [ConnectionOpenThread] ConnectionOpenThread.run() WARNING: Cant obtain socket channel from selection key, throttling activated = {0}, for port: {1}

2014-02-10 21:29:00.603 [ConnectionOpenThread] ConnectionOpenThread.run() WARNING: Cant obtain socket channel from selection key, throttling activated = {0}, for port: {1}

Can you provide a little insight into what these mean?

Thanks,

Sean


Replies (20)

Added by Subir Jolly about 5 years ago

Just for some background, you may also have noticed in init.properties.. We do not register the users in our chat system's database at all. We just authenticate the user via an external api using a custom AuthRepository and if the user is authenticated, then we just log the user in. Also, we are hitting the server directly instead of having a proxy between the server and the load testing client(s).

I am also seeing a lot of:

2014-02-10 22:39:32.554 [pool-11-thread-13]  BoshSession.sendBody()           INFO:     No waitTimer for the Bosh connection! bosh@app1.dev.wa-chat/172.19.64.143_5280_172.19.64.116_40288, type: accept, Socket: bosh@app1.dev.wa-chat/172.19.64.143_5280_172.19.64.116_40288 Socket[addr=/172.19.64.116,port=40288,localport=5280], jid: null
Avatar?id=6023&size=32x32

Added by Artur Hefczyc TigaseTeam about 5 years ago

Please have a look at this post: http://www.tigase.org/content/perfomance-test and this: http://www.tigase.org/content/new-connections-throttling

Let me know if that helps.

Added by Subir Jolly about 5 years ago

Currently, tigase is alotted 10G of memory. So, according to this I added these lines in init.properties.

--max-queue-size=100000

--new-connections-throttling = 5280:50000,5222:50000

Still having the same issue. I did not see this warning this time though:

WARNING:  Cant obtain socket channel from selection key, throttling activated = {0}, for port: {1}
Avatar?id=6023&size=32x32

Added by Artur Hefczyc TigaseTeam about 5 years ago

You do not need to adjust max-queue-size, the problem is not related to a queue size. Besides, the queue size is adjusted automatically based on available memory.

Do you use any kind of a custom logging framework or log formatter for Java? It does not print log messages properly for you. In place of the {0} there should be either true or false meaning whether the throttling was activated or not and in place of {1} should be port number. And this is important because it would give us some hint on the other possible causes of the error. It maybe not throttling but something else, such as OS level misconfiguration. There are a few questions I need answered to be able to help you more:

  1. What is the max number of open files for a user under which Tigase is run?

  2. What OS do you use?

  3. What user account do you run Tigase form?

  4. What Tigase version do you use?

  5. What Java version and vender do you use?

  6. Is the software simulating user load running on the same machine as Tigase server?

  7. If different what is port range set to on the client side machine?

Added by Sean Drummond about 5 years ago

We do not use any custom logging framework we use java.utils.logging

1) 65536

2) CentOS release 6.4 (Final)

3) Tigase

4) tigase-server-5.2.0-RC1

5) jdk-1.7.0_40-fcs.x86_64 Oracle

6) No we have several vms that hit a proxy that then hits tigase (we have also removed the proxy and get the same results)

7) High ports on client, hitting port 80 on proxy that forwards to 5280 on the tigase server. If we hit tigase directly we use high ports to 5280.

Avatar?id=6023&size=32x32

Added by Artur Hefczyc TigaseTeam about 5 years ago

My suspicions are that you run out of resources either port numbers of file descriptors on either the server side or on the client side. Do you have any other errors or warning lon entries in the Tigase logs? Either the tigase.log.0 or tigase-console.log.

What kind of VM do you use? Are all of the VMs on the same physical machine? I am asking because some of virtualization framework work in such a way that they in fact share resources (port numbers, file descriptors) from the host system. Hence in load tests case you may quickly run out of certain resources.

Added by Sean Drummond about 5 years ago

We are not running out of FDs on the server side and I highly doubt we are on the client side as we have 72 VMs that are able to spread the load. I have not spent much time in tigase-console.log I will check it first thing tomorrow morning.

The VMs are spread across multiple physical servers I will be double checking the resources on those as well.

Avatar?id=6023&size=32x32

Added by Artur Hefczyc TigaseTeam about 5 years ago

Hm, after your first post I was under the impression that the problems starts at 25 online users. And that you get hundreds per second error long entries. It seems unlikely that you use 72 VMs to generate load from 25 online users or something about it. Could you please clarify that numbers? What kind of tests exactly you are running? Is Tigase running in a cluster mode or on a single machine, what is the number of online users, what is the login per second rate during your test, user presence status change rate, etc.....

If you run a significant load tests than 65536 FD is very small number, especially if you run tests over Bosh server. For Bosh every single request (message, presence, any XMPP packet) between a client and the Bosh server is done over a new TCP/IP (HTTP) connection. So you may get hundreds or thousands opened/closed TCP/IP (HTTP) connections per second on the server. This requires a special OS configuration to make sure resources are quickly returned to the system and are available for use.

Added by Sean Drummond about 5 years ago

Currently we reduced our tests to include only one tigase server. We are using 21 VMs running tsung that are hitting the tigase server directly on port 5280. Each Tsung server will login 1 user every 3 seconds up to 100 users the will also be sending a message every 5 seconds with a length of 16 bytes. We have upped the FD to 1,000,000 on tigase and 65,000 on the tsung servers. We have also added the following configs to tigase:

--cm-traffic-throttling = xmpp:0:0:drop,bin:0:0:drop

--cm-ht-traffic-throttling = xmpp:0:0:drop,bin:0:0:drop

--new-connections-throttling = 5222:5000000,5280:5000000,5223:5000000,5277:5000000

When starting the test we get this immediately:

2014-02-12 13:13:51.367 [pool-10-thread-73] BoshSession.sendBody() INFO: No waitTimer for the Bosh connection! bosh@app2.wa-chat.ord1s.net/192.168.64.98_5280_192.168.65.91_43481, type: accept, Socket: bosh@app2.wa-chat.ord1s.net/192.168.64.98_5280_192.168.65.91_43481 Socket[addr=/192.168.65.91,port=43481,localport=5280], jid: null

Eventually we get this as well:

[ConnectionOpenThread] ConnectionManager.serviceStarted() WARNING: bosh: Attempt to add different service with the same ID: null, type: accept, Socket: nullSocket[addr=/192.168.65.51,port=51146,localport=5280], jid: null

Avatar?id=6023&size=32x32

Added by Artur Hefczyc TigaseTeam about 5 years ago

Sean Drummond wrote:

Currently we reduced our tests to include only one tigase server. We are using 21 VMs running tsung that are hitting the tigase server directly on port 5280. Each Tsung server will login 1 user every 3 seconds up to 100 users the will also be sending a message every 5 seconds with a length of 16 bytes.

And you have roster size 50, right?

So, let's run some calculations:

  1. 21 user logins every 3 seconds gives us 7 logins per second - each login is approx 10 XMPP requests - 70 requests per second from logins only.

  2. Each user has 50 contacts to which the server sends their online presence - for 7 logins we have 7 * 50 = 350 presence packets per second from logins only

  3. 100 logged in users from 21 VMs gives us 2,100 online users and each sends a message every 5 seconds = 420 message packets per second from all online users

  4. Total traffic is about 840 packets per second.

This is really not an issue as 840 packets is considered really a low traffic but in case of Bosh it means 840 TCP/IP (HTTP) connections opened (and closed) per second. Depending on the OS configuration closed TCP/IP connection is not immediately returning all resources, it is in WAIT stare for some time depending on the settings in the OS. If this timeout is set to something like 60 second, you may have over 50,000 TCP/IP connections in WAIT state.

We have upped the FD to 1,000,000 on tigase and 65,000 on the tsung servers. We have also added the following configs to tigase:

--cm-traffic-throttling = xmpp:0:0:drop,bin:0:0:drop

--cm-ht-traffic-throttling = xmpp:0:0:drop,bin:0:0:drop

--new-connections-throttling = 5222:5000000,5280:5000000,5223:5000000,5277:5000000

When starting the test we get this immediately:

2014-02-12 13:13:51.367 [pool-10-thread-73] BoshSession.sendBody() INFO: No waitTimer for the Bosh connection! bosh@app2.wa-chat.ord1s.rsapps.net/192.168.64.98_5280_192.168.65.91_43481, type: accept, Socket: bosh@app2.wa-chat.ord1s.rsapps.net/192.168.64.98_5280_192.168.65.91_43481 Socket[addr=/192.168.65.91,port=43481,localport=5280], jid: null

Andrzej will look in the code and respond to this.

Eventually we get this as well:

[ConnectionOpenThread] ConnectionManager.serviceStarted() WARNING: bosh: Attempt to add different service with the same ID: null, type: accept, Socket: nullSocket[addr=/192.168.65.51,port=51146,localport=5280], jid: null

This is why I asked you about port range set on client side machines. Even with only 100 users from each machine but a small port range set you may have some problems on the client side machines for Bosh.

The warning message above means exactly this: "A new TCP/IP connection has been opened from an IP address and a port number that already is open to the server." The ID message mentions is created from source IP, PORT and destination IP, PORT. Destination IP and PORT is always the same - this is server side end point.

As explained above, the rapid TCP/IP connection closing and opening happens on both the client machines and the server machines. However, on the server side there are much more connections being closed and opened so this might be happening with some delay comparing to the client side machines. Therefore, some connections which appear to be closed (and ports available) on the client side, may still be seen as opened on the server side. The client side opens a new Bosh request on a new TCP/IP connection and reuses some PORT number which is still seen as opened by the server. Hence the WARNING log entry on the server side.

You can reduce the occurrence of the WARNING by increasing PORT range on the client machines.

Despite the WARNING the whole event is correctly handled on the server side and the old connection is closed and the new connection is used as the active one. Perhaps WARNING level is too high in this case.

Added by Andrzej Wójcik IoT 1 CloudTigaseTeam about 5 years ago

Following INFO level entry in log:

2014-02-12 13:13:51.367 [pool-10-thread-73] BoshSession.sendBody() INFO: No waitTimer for the Bosh connection! bosh@app2.wa-chat.ord1s.rsapps.net/192.168.64.98_5280_192.168.65.91_43481, type: accept, Socket: bosh@app2.wa-chat.ord1s.rsapps.net/192.168.64.98_5280_192.168.65.91_43481 Socket[addr=/192.168.65.91,port=43481,localport=5280], jid: null

may appear if BOSH component will try to send response on newly created BOSH connection for which waitTimer was not set yet. This may occur and it is safe to ignore, so I would say that INFO level may to high for this.

Added by Sean Drummond about 5 years ago

I increased the port range on the client side to "1024 64000" and after the 150th user is added I start getting [ConnectionOpenThread] ConnectionManager.serviceStarted() WARNING: bosh: Attempt to add different service with the same ID: null, type: accept, Socket: nullSocket[addr=/192.168.65.51,port=51146,localport=5280], jid: null several hundred times per minute still. Is there a way to resolve this on the tigase side? This is with only one client adding 1 user per second up to 250 users.

Thank you,

Sean

Added by Sean Drummond about 5 years ago

I increased the port range on the client side to "1024 64000" and after the 150th user is added I start getting [ConnectionOpenThread] ConnectionManager.serviceStarted() WARNING: bosh: Attempt to add different service with the same ID: null, type: accept, Socket: nullSocket[addr=/192.168.65.51,port=51146,localport=5280], jid: null several hundred times per minute still. Is there a way to resolve this on the tigase side? This is with only one client adding 1 user per second up to 250 users.

Thank you,

Sean

Avatar?id=6023&size=32x32

Added by Artur Hefczyc TigaseTeam about 5 years ago

Just so you understand I have to make a remark. All our suggestions here are just theories based on the limited information we have about your system, configuration and tests. All we can do here is to say: try this, try that. Without a proper investigation we are guessing based on our experience and knowledge.

The best way to solve the problem is to properly investigate it, find source of the problem and fix it.

Otherwise we just have new suggestions, try this, try that.

What I would do on the Tigase server side right now is adjusting OS level (Linux kernel) settings for TCP/IP protocol. Through sysctl.conf you can change various timeouts for the TCP/IP protocol. I would start with WAIT state timeout. There are various others related to keeping connection alive, testing if it is broken, etc... Default Linux settings are prepared more for HTTP traffic which is very different from XMPP, especially different from Bosh. I worked with a customer a few years ago who also had only Bosh connections on a very large scale. It tools us a while to get the settings right but eventually we had it working quite well.

This should help.

Added by Sean Drummond about 5 years ago

We are starting to veer away from using Bosh altogether because of these issues. We have monitoring setup and when looking at the TigaseConnectionBoshCount while ramping up our tests it quickly gets up to about 20,000 with only 300 users online at that point. We are logging in a total of 5 users per second from a collection of 24 tsung servers. Then we receive:

ConnectionManager.serviceStarted() WARNING: bosh: Attempt to add different service with the same ID:

And then the tigase server becomes unresponsive while TigaseConnectionBoshCount continues to grow up until about 90,000

If you could let us know what it took to get that previous customer setup to use Bosh as such a high scale it may help us otherwise we may just default to xmpp only.

Avatar?id=6023&size=32x32

Added by Artur Hefczyc TigaseTeam about 5 years ago

I do not have any details from the other customer installation. I am pretty sure they do not have that much traffic over Bosh though. Certainly Bosh is not suitable for a high traffic installations, you can see for yourself the way Bosh works it just kills the server.

A few ways you could mitigate the problem are:

  1. Use HTTP keep-alive for Bosh connections, this way you can have multiple Bosh requests on a single HTTP (TCP/IP) connection. Result is close to standard XMPP.

  2. Use Web Sockets when possible - the same result as above - close to standard XMPP connection.

We have customers with logic which attempts to connect via web sockets if possible, then try HTTP keep-alive and standard Bosh as the last option. If the small percentage of users are on the most ineffective Bosh than this does not affect the service in a significant way.

Added by Sean Drummond about 5 years ago

That is one option we are discussing. However during the testing certain scenarios while using bosh brought down tigase and it no longer responded. I am worried if we allow bosh at all we will find ourselves in this predicament.

Avatar?id=6023&size=32x32

Added by Artur Hefczyc TigaseTeam about 5 years ago

I wonder why it stopped responding, was it an OOM crash or out of file descriptors or internal queues full? Do you have any more information on this?

Of course if you can avoid Bosh and go with a standard XMPP connections, this is highly recommended for many reasons. However, if for some reason Bosh is necessary in certain cases, we could always implement additional limitations for Bosh connections. Similarly to connections and traffic throttling we could add something which would put some limits on Bosh to reduce the service impact.

Added by Sean Drummond about 5 years ago

It seems tigase thinks it is using a ton of connections when in reality it is not. Shortly into my tests tigase monitoring is reporting:

While the server is reporting:
<pre>
[root@app2 ~]# netstat -a|wc -l
2289
[root@app2 ~]# lsof |grep tigase|wc -l
3060
[root@app2 ~]# free -m
             total       used       free     shared    buffers     cached
Mem:         72492      13721      58770          0        535       2140

When trying to connect via adium I get "Error: Unable to establish a connection with the server" and "Bad file descriptor."

Avatar?id=6023&size=32x32

Added by Artur Hefczyc TigaseTeam about 5 years ago

Sean Drummond wrote:

It seems tigase thinks it is using a ton of connections when in reality it is not. Shortly into my tests tigase monitoring is reporting:

How do you collect this information? It is possible that this number is updated with some delay but I am surprised the numbers are so different. Unless there is a bug in Tigase and either the number is not updated or Bosh session in Tigase is not destroyed correctly in some cases. Also, there is this Bosh timeout - inactivity timeout which affects how long Tigase keeps Bosh session data in memory even if user is no longer connected to the server. Another thing is that even for still connected users Tigase keeps old connections because it is always ready for reusing them in keep-alive mode if the Bosh client decided to send more data over the old connection.

While the server is reporting:

Again, it is hard to tell for sure without investigating. Just guessing here. I remember similar problems with getting consistent numbers when testing Bosh and a significant number of TCP/IP connections were in WAIT state. Also, what kind of Java do you use there? What Linux kernel? Have you changed any Linux kernel settings for TCP/IP?

There is really lots of elements to verify. I can send you suggestion after suggestion and we can probably get even 100% improvement but in your case it does not seem good enough. I doubt you can get any dramatic results for Bosh and high traffic anyway.

[...]

When trying to connect via adium I get "Error: Unable to establish a connection with the server" and "Bad file descriptor."

    (1-20/20)