Project

General

Profile

Intermittent BOSH connection request failures

Anonymous
Added over 5 years ago

I am trying to login XMPP users from an external application(lets call it PS) using BOSH on the Tigase server. The PS uses the standard BOSH httpbind mechanism to connect to Tigase. This application has to support logging in a bunch of users on Tigase at any given second. During my initial testing, I am experiencing some issues that not all BOSH connections are responded to by the Tigase server.

For e.g. I have 6 connections that I have to establish from PS on Tigase and I can see all 6 bind requests being received on the Tigase server(logs). But only 4 of them are being responded to, the other two time out and the PS application does not receive a response back from Tigase. Note that all 6 bind requests hit Tigase at the very same second. This is again not consistent, sometimes only 2 connections are accepted, sometimes 3. But not even once I saw that all 6 connections were established.

Could this be because there are multiple bosh connection requests coming thru sockets with the same source IP? For e.g. the conn. tuples are like 101.56.10.53_5288_101.56.10.53_63742, 101.10.10.53_5288_101.10.10.53_63741 - 101.56.10.53 is the IP of Tigase and 5288 is a dedicated bosh port for the PS application.

I am trying to debug the issue by turning on logging on the 'net' and 'io' packages.

I am not sure if this a bug or it is something that can be handled with configuration or something else.

Has anyone seen this behavior before or can someone advise me as to where the issue could be or what to look for in the logs?

Thanks a lot.


Replies (6)

Avatar?id=6023&size=32x32

Added by Artur Hefczyc TigaseTeam over 5 years ago

Tigase was tested for 500 - 1000 user logins as second. It can certainly handle as much as the underlying database can handle. There are production systems which are under a load of hundreds logins per second.

6 logins a second is not an issue at all. At least not on the server side. What Tigase server version do you use?

The problem with Bosh however is that each user request is run on a different TCP/IP (HTTP) connection. Tigase responds on the oldest opened connection from a client. Timings and events order is critical. Of the client does not follow the spec the connection will fail. It should be possible to track this down from the server logs. What happened and what was missing from the client that the server could not respond and complete the user login. A Bosh session has a unique ID (SID) which persists over multiple TCP/IP (HTTP) connections. This SID is printed out in the server log file. You should be able to track down broken Bosh session by the SID and find out what happened.

Added by Anonymous over 5 years ago

From my application, I open a single BOSH Connection/Session per user and I can see the body element being printed out in the logs. The initial httpbind body hits the server for all 6 users. But only some of them get a sid generated and responded to. So the trouble is even getting the BOSH Session established.

I totally agree that the capacity of Tigase is not a limiting factor here as you say. I understand that we can certainly support 1000's of concurrent connections.

We are using Tigase server version 5.1, not sure about the build number. Unfortunately, upgrading is not an option at this point unless there is something which would partcularly help this error/issue.

This is the body which is sent and each user's obviously has a different request id.

I am tracing the .net and .io classes in the logs, particularly the SimpleParser, XMPPDomBuilderHandler, XMPPIOService etc and one place where I see a difference between the requests that works and the one that does not work is that the endElement() on the parser does not get executed for the request that breaks. I see that the SimpleParser is a singleton instance in the server and it is supposed to be thread safe.

I am attaching the trace for a connection request that works(10.56.10.53_5288_10.56.10.53_53192) and one that does not work(10.56.10.53_5288_10.56.10.53_53191).

If you compare it on a file diff tool, you will see a difference after the startElement() call.

The issue is quite arbitrary and there is no pattern to it.

Thanks a lot for the help.

Avatar?id=6023&size=32x32

Added by Artur Hefczyc TigaseTeam over 5 years ago

Unfortunately Tigase does not print the exact data it received from the network. So we do not know what was received. The number of bytes/characters is different so there might be something different. Could you install the last beta of 5.2.0 just for the sake of having more complete log information? The last version should give us more details.

Added by Anonymous over 5 years ago

If you can look at the two log files attached. They were for different users, but both tried connecting at the same time. And Tigase does print the data that it received, in the IOService class(?). So I am not sure if we are really missing any data.

Did you take a look at the log file, any clues at all? I am logging everything from the net, io, xml packages.

Are you suggesting that something is getting lost or messed up in the network? If so, atleast all the connections(6 of them) in a given test could fail atleast once right?

I consistently see some random number out of 6 being successful.

Unfortunately, 5.20 upgrade is not an option at this point as my application is only an external entity and I do not have control over the prod version of Tigase that is being deployed.

Added by Anonymous over 5 years ago

I have figured out that the problem is with the xlightweb library which I was using from my application to create the BOSH Connections and do the other BOSH interactions. The issue was in the User-agent HTTP header, the version of XLightweb I was using was putting "XLightweb/" on certain requests. I found this after closely examining the packets that Tigase was printing out in the logs. Had to add a lot more log statements and turn on FINEST logging on the net and io packages to see it.

I upgraded the xlightweb library and now the issue seems to have gone away.

Sorry about the false alarm. This, as of now, definitely does not seem to be a Tigase problem.

Avatar?id=6023&size=32x32

Added by Artur Hefczyc TigaseTeam over 5 years ago

Thank you for update, I am glad it has been sorted out. I am sorry I could not get back to you earlier.

    (1-6/6)