Project

General

Profile

There is no session with given SID.

Robert Larsen
Added almost 5 years ago

Hi

I am testing Tigase in production by having a script (using Strophe.js) running in the background in our users browsers (10% of them). It connects to Tigase, joins a room and sends messages every now and then.

However, I get a lot of "bad session" entries in the log:

# wc -l tigase.log.0 && (grep 'no session with given SID' tigase.log.0 | wc -l)
38783 tigase.log.0
32738

You can see the test script here: http://www.komogvind.dk/javascript/tigase_test.js

We use tigase-server-5.2.1-b3461

There are valid connections to the server, so it is not a general problem but with so many log entries it is very noticable.

What can I do to debug this?


Replies (20)

Added by Robert Larsen almost 5 years ago

I don't know if it's related, but with around 900 simultanious users the Tigase process generates 240% load.

The machine has six cores with four threads each and they run at 2GHz, so it can take it, but that few users should not generate that load.

I will try attaching a profiler later.

Added by Robert Larsen almost 5 years ago

A little something to illustrate the rate at which bad session id's come in:

# while true; do echo $(date)": "$(grep 'no session with given SID' logs/tigase.log.0 | wc -l); sleep 1; done
Thu Jun 19 14:35:40 CEST 2014: 30689
Thu Jun 19 14:35:41 CEST 2014: 30912
Thu Jun 19 14:35:42 CEST 2014: 31135
Thu Jun 19 14:35:43 CEST 2014: 31360
Thu Jun 19 14:35:44 CEST 2014: 31582
Thu Jun 19 14:35:45 CEST 2014: 31790
Thu Jun 19 14:35:46 CEST 2014: 32023
Thu Jun 19 14:35:47 CEST 2014: 32236
Thu Jun 19 14:35:48 CEST 2014: 32459
Thu Jun 19 14:35:49 CEST 2014: 32653

Added by Wojciech Kapcia TigaseTeam almost 5 years ago

Could you track back SIDs that generate the invalid SID error? did they follow and match timings of bosh sessions and bosh connections?

If you are willing to try one of the latest snapshots it includes handy bosh logger which creates separate file only with information regarding bosh sessions/connections which makes the tracking easier; it can be enabled by following property in init.properties:

bosh/sid-logger-level[S]=FINEST

Added by Robert Larsen almost 5 years ago

Could you track back SIDs that generate the invalid SID error? did they follow and match timings of bosh sessions and bosh connections?

It will take some tcpdumping but I can do that.

However, I use Strophe.js on the client which has always worked beautifully, so I doubt it is the one causing issues.

If you are willing to try one of the latest snapshots it includes handy bosh logger which creates separate file only with information regarding bosh sessions/connections which makes the tracking easier;

Cool, I tried the latest source (Tigase ver. 5.3.0-SNAPSHOT-b3527/800c2460 (2014-06-19/15:51:25)) and that config property, but I don't see any new files in the logs directory.

Should a path be specified too?

Added by Wojciech Kapcia TigaseTeam almost 5 years ago

No, output should be in @logs/bosh_sid.log@.

I tried the latest source (Tigase ver. 5.3.0-SNAPSHOT-b3527/800c2460 (2014-06-19/15:51:25))

I assume you build it by yourself? could you try origin/devel branch?

Added by Robert Larsen almost 5 years ago

I assume you build it by yourself? could you try origin/devel branch?

I built it myself yes. I tried the devel branch but still no extra log files thou.

The tigase.log.0 file says that the sid-logger-level config was picked up:

logs/tigase.log.0:2014-06-20 14:13:16.734 [main]             ConfiguratorAbstract.itemUpdated()  INFO:    Updating configuration item: bosh/sid-logger-level=FINEST

Still lots of bad SIDs:

# last=$(grep 'no session with given SID' logs/tigase.log.0 | wc -l);while true; do sleep 1; cur=$(grep 'no session with given SID' logs/tigase.log.0 | wc -l); echo $(date) $((cur-last)); last=$cur; done
Fri Jun 20 14:19:00 CEST 2014 108
Fri Jun 20 14:19:01 CEST 2014 98
Fri Jun 20 14:19:02 CEST 2014 102
Fri Jun 20 14:19:03 CEST 2014 113
Avatar?id=6023&size=32x32

Added by Artur Hefczyc TigaseTeam almost 5 years ago

Have you added the config line provided above? The new version will not change any Bosh behavior but the additional SID log will give you additional details to investigates the problem.

Added by Robert Larsen over 4 years ago

Apparently my git-fu sucks. I had changed to the developer branch but pulled badly, now I get the log file.

I get lots of lines like this:

2014-06-24 13:00:38.733 [pool-13-thread-15]  BoshConnectionManager.processSocketData()  FINE: e019d5e3-85d0-417a-b5b0-536de3435630 : INVALID_SID (Invalid SID)
2014-06-24 13:00:38.741 [pool-13-thread-60]  BoshConnectionManager.processSocketData()  FINE: ed105225-5bed-44c4-85cf-c57be9cfbb2c : INVALID_SID (Invalid SID)
2014-06-24 13:00:38.750 [pool-13-thread-33]  BoshConnectionManager.processSocketData()  FINE: 36d92142-2737-4678-819c-a6e72c84543d : INVALID_SID (Invalid SID)
2014-06-24 13:00:38.764 [pool-13-thread-86]  BoshConnectionManager.processSocketData()  FINE: 97aa521c-a851-4cbb-a95d-f181ba224120 : INVALID_SID (Invalid SID)
2014-06-24 13:00:38.776 [pool-13-thread-40]  BoshConnectionManager.processSocketData()  FINE: 7a4eb770-b3d1-4129-9efc-ab191091d303 : INVALID_SID (Invalid SID)
2014-06-24 13:00:38.778 [pool-13-thread-41]  BoshConnectionManager.processSocketData()  FINE: f217cca8-8b10-4fcf-aa5b-9dbdd616d73c : INVALID_SID (Invalid SID)

..and this:

2014-06-24 13:02:48.468 [pool-13-thread-71]  BoshSession.disconnected()       FINEST:   2dcef8de-a8f9-4a4b-a230-0c8023eed37b : TIMER (Canceling inactivityTimer: 10.0.1.9_5281_80.198.94.32_26271)
2014-06-24 13:02:48.468 [pool-13-thread-71]  BoshSession.disconnected()       FINEST:   2dcef8de-a8f9-4a4b-a230-0c8023eed37b : TIMER (Setting inactivityTimer for 10 on: 10.0.1.9_5281_80.198.94.32_26271)
2014-06-24 13:02:48.477 [pool-13-thread-20]  BoshSession.disconnected()       FINEST:   04440fdd-67b8-4fa5-a353-76005b4de223 : TIMER (Canceling inactivityTimer: 10.0.1.9_5281_109.198.128.74_53430)
2014-06-24 13:02:48.477 [pool-13-thread-20]  BoshSession.disconnected()       FINEST:   04440fdd-67b8-4fa5-a353-76005b4de223 : TIMER (Setting inactivityTimer for 10 on: 10.0.1.9_5281_109.198.128.74_53430)
2014-06-24 13:02:48.570 [pool-13-thread-94]  BoshSession.disconnected()       FINEST:   2dcef8de-a8f9-4a4b-a230-0c8023eed37b : TIMER (Canceling inactivityTimer: 10.0.1.9_5281_80.198.94.32_26271)
2014-06-24 13:02:48.570 [pool-13-thread-94]  BoshSession.disconnected()       FINEST:   2dcef8de-a8f9-4a4b-a230-0c8023eed37b : TIMER (Setting inactivityTimer for 10 on: 10.0.1.9_5281_80.198.94.32_26271)

I have tried the same clientside test code against OpenFire and ejabberd where it works fine, and I cannot reproduce this in a test in the office on the developers network (far fewer clients).

Added by Robert Larsen over 4 years ago

The timeouts may be fine since this is a web client so the connection is dropped when the client changes page, but the INVALID_SID thing I think is bad.

They come in at nearly 200 lines per second, but it is a few clients that make the bulk of it.

Time frame:

# head -n 1 bosh_sid.log && tail -n 1 bosh_sid.log
2014-06-24 13:48:03.534 [pool-13-thread-2]  BoshConnectionManager.processSocketData()  FINE: e1e72eef-2c2d-494d-a422-245db81247e7 : CREATE (Socket bosh session)
2014-06-24 13:48:18.386 [pool-13-thread-39]  BoshConnectionManager.processSocketData()  FINE: 36d92142-2737-4678-819c-a6e72c84543d : INVALID_SID (Invalid SID)

Number of unique SIDs:

# grep INVALID_SID bosh_sid.log |awk '{print $6}'|sort|uniq|wc -l
39

Sorted by number of occurences:

# grep INVALID_SID bosh_sid.log |awk '{print $6}'|sort|uniq -c|sort -n
      1 16a87915-ae58-4836-897b-55113ec80919
      1 2164f5e4-8666-4bf2-a06d-e8b8ce36c208
      1 30d1d923-7754-4ee7-b4cb-47464a858f7b
      1 3abd6636-8f1b-46f7-aaa5-a167336f3944
      1 486d4c60-f874-4d38-b187-817082b94cdd
      1 56f79b15-f669-4882-9c66-e8362bb9d638
      1 6f6242ab-076c-47b5-add5-1b8db637e085
      1 ae63f51c-64b8-47b8-8232-c02302c86f85
      1 b3975662-da37-4bda-8243-89e3c744bf14
      1 b74ef267-0e59-48f5-b789-8f7013afccb1
      1 bf98cc1f-9ece-474f-a6cc-c4e676b73cad
      1 f825d53f-fdd0-47f3-b8bc-ad75af88e6f8
      8 null
     15 cdf6e68a-e9e5-4b00-ac48-96697040f18c
     16 1fe0edf0-2b89-4021-814a-a471de10b535
     16 310f1ba1-9dfe-4d4d-8963-8466ea5e537d
     16 7cefebcf-f332-4fe1-a814-6db990e875bc
     29 e019d5e3-85d0-417a-b5b0-536de3435630
     33 d1facb55-f92d-481c-b575-228eaedf717c
     53 22014127-b15d-4241-8297-366e848966fd
     53 e705713d-3e17-4b12-8ea5-22cff52afdd1
     54 4254d879-66be-4e19-bcd3-3f7bcbffde13
     72 1c361265-622b-469e-9a14-613b2821cc07
     74 14a919b9-3857-4fb0-bd68-eed2112a1fe0
     83 219bac18-d337-4831-9815-0d7cea52b923
     92 07e04524-1399-4ea4-8e82-73483f8d4a8b
    122 367f07b6-bd5f-419a-8294-4f97ae62d1a5
    125 f11884cb-85ab-4005-b8a5-e168f44050ac
    127 ed105225-5bed-44c4-85cf-c57be9cfbb2c
    129 25882ae5-4d4a-4595-a506-2bc7f3c8fc94
    134 7f03faf5-4107-49ab-ad4b-42a3e365d0a1
    135 8070bcca-1468-4a36-91cf-aa8345874143
    137 9b5ac5a0-0d16-451c-921a-2585defa0c5f
    141 121c2524-34bb-4bd6-a5b8-98c832121c96
    147 36d92142-2737-4678-819c-a6e72c84543d
    147 764148e7-e834-4813-8acd-583deb464813
    148 97aa521c-a851-4cbb-a95d-f181ba224120
    148 b74238ca-ad41-497f-b0c1-c69269988288
    149 c3dd3a8a-2afc-4dc6-b213-59826a1a6aa5

I have a pcap of the above timeframe (give or take a few seconds) and I have separated out the guy with the 149 occurences. I see lots of these per second:

Request:

POST / HTTP/1.1
Accept: */*
Origin: http://www.komogvind.dk
Accept-Language: da
Accept-Encoding: gzip, deflate
User-Agent: Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.0; Trident/5.0)
Host: chat.cego.dk:5281
Content-Length: 111
Connection: Keep-Alive
Cache-Control: no-cache

<body rid='1539809852' xmlns='http://jabber.org/protocol/httpbind' sid='c3dd3a8a-2afc-4dc6-b213-59826a1a6aa5'/>

Response:

HTTP/1.1 404 Invalid SID
Content-Type: text/xml; charset=utf-8
Content-Length: 253
Access-Control-Allow-Origin: *
Access-Control-Allow-Methods: GET, POST, OPTIONS
Access-Control-Allow-Headers: Content-Type
Access-Control-Max-Age: 86400
Connection: close
Server: Tigase Bosh/5.3.0-SNAPSHOT-b3596/9d863722 (2014-06-24/12:53:23)

<body xmlns="http://jabber.org/protocol/httpbind" type="error"><error type="cancel" code="404"><item-not-found xmlns="urn:ietf:params:xml:ns:xmpp-stanzas"/><text xml:lang="en" xmlns="urn:ietf:params:xml:ns:xmpp-stanzas">Invalid SID</text></error></body>

The 'rid' field is incremented by each request, but the rest is left untouched.

Not being an expert on XMPP or BOSH, does that look right?

Should StropheJS back off?

Added by Wojciech Kapcia TigaseTeam over 4 years ago

How to proceed - if you have enabled the sid-logging then after having first occurance of Invalid SID please filter out sid log for this particular SID and observe if the connections are correctly established and then timers reset (yes, those are correct and inherent to the bosh because of it's request-response nature) and then try to track down. Most likely after a series of timer events (setting-cancelling) there should be entry that because of long inactivity bosh session is closed (client failed to make subsequent request in given time) and then all subsequent request, even though RID is incremented correctly, would result in Invalid SID as the session would got scrapped.

Added by Robert Larsen over 4 years ago

I did this over about 18 minutes and 22 seconds:

# grep  d4ec1c1d-7707-4303-b915-0dad9a4a8845 bosh_sid.log|head -n 1 && grep d4ec1c1d-7707-4303-b915-0dad9a4a8845 bosh_sid.log | tail -n 1
2014-06-25 13:51:56.503 [pool-13-thread-43]  BoshConnectionManager.processSocketData()  FINE: d4ec1c1d-7707-4303-b915-0dad9a4a8845 : INVALID_SID (Invalid SID)
2014-06-25 14:10:18.233 [pool-13-thread-14]  BoshConnectionManager.processSocketData()  FINE: d4ec1c1d-7707-4303-b915-0dad9a4a8845 : INVALID_SID (Invalid SID)

No timeout in over 18 minutes, and many requests made:

# grep  d4ec1c1d-7707-4303-b915-0dad9a4a8845 bosh_sid.log|grep INVALID_SID|wc -l
6382

...or close to six requests per second.

It seems to me that this issue is with StropeJS.

Added by Wojciech Kapcia TigaseTeam over 4 years ago

Robert Larsen wrote:

No timeout in over 18 minutes, and many requests made:

But there is not even session creation information at all. Looking at INVALID_SID won't tell you much as the session is already long gone. You have to track down to the moment before INVALID_SID started to show for particular SID.

Added by Robert Larsen over 4 years ago

Alrighty...I have this one:

2014-06-25 15:40:32.682 [pool-13-thread-29]  BoshConnectionManager.processSocketData()  FINE: 7a1e9800-aaf2-4e4e-b9d1-ae3f09c4baff : CREATE (Socket bosh session)
2014-06-25 15:40:32.682 [pool-13-thread-29]  BoshSession.disconnected()       FINEST:   7a1e9800-aaf2-4e4e-b9d1-ae3f09c4baff : TIMER (Setting inactivityTimer for 10 on: 10.0.1.9_5281_67.248.249.157_49634)
2014-06-25 15:40:34.531 [pool-13-thread-62]  BoshSession.processSocketPacket()  FINEST: 7a1e9800-aaf2-4e4e-b9d1-ae3f09c4baff : TIMER (Canceling inactivityTimer: 10.0.1.9_5281_67.248.249.157_49634)
2014-06-25 15:40:34.531 [pool-13-thread-62]  BoshSession.processSocketPacket()  FINEST: 7a1e9800-aaf2-4e4e-b9d1-ae3f09c4baff : TIMER (Scheduling waitTimer: 10.0.1.9_5281_67.248.249.157_49634)
2014-06-25 15:40:34.531 [pool-13-thread-62]  BoshSession.disconnected()       FINEST:   7a1e9800-aaf2-4e4e-b9d1-ae3f09c4baff : TIMER (Canceling inactivityTimer: 10.0.1.9_5281_67.248.249.157_49634)
2014-06-25 15:40:34.531 [pool-13-thread-62]  BoshSession.disconnected()       FINEST:   7a1e9800-aaf2-4e4e-b9d1-ae3f09c4baff : TIMER (Setting inactivityTimer for 10 on: 10.0.1.9_5281_67.248.249.157_49634)
2014-06-25 15:40:37.512 [pool-13-thread-33]  BoshSession.processSocketPacket()  FINEST: 7a1e9800-aaf2-4e4e-b9d1-ae3f09c4baff : TIMER (Canceling waitTimer: 10.0.1.9_5281_67.248.249.157_49634)
2014-06-25 15:40:37.512 [pool-13-thread-33]  BoshSession.processSocketPacket()  FINEST: 7a1e9800-aaf2-4e4e-b9d1-ae3f09c4baff : TIMER (Canceling inactivityTimer: 10.0.1.9_5281_67.248.249.157_49634)
2014-06-25 15:40:37.512 [pool-13-thread-33]  BoshSession.processSocketPacket()  FINEST: 7a1e9800-aaf2-4e4e-b9d1-ae3f09c4baff : TIMER (Scheduling waitTimer: 10.0.1.9_5281_67.248.249.157_49634)
2014-06-25 15:40:37.564 [in_8-bosh]        BoshSession.disconnected()         FINEST:   7a1e9800-aaf2-4e4e-b9d1-ae3f09c4baff : TIMER (Canceling inactivityTimer: 10.0.1.9_5281_67.248.249.157_49634)
2014-06-25 15:40:37.564 [in_8-bosh]        BoshSession.disconnected()         FINEST:   7a1e9800-aaf2-4e4e-b9d1-ae3f09c4baff : TIMER (Setting inactivityTimer for 10 on: 10.0.1.9_5281_67.248.249.157_49634)
2014-06-25 15:40:39.498 [pool-13-thread-85]  BoshSession.processSocketPacket()  FINEST: 7a1e9800-aaf2-4e4e-b9d1-ae3f09c4baff : TIMER (Canceling waitTimer: 10.0.1.9_5281_67.248.249.157_49634)
2014-06-25 15:40:39.498 [pool-13-thread-85]  BoshSession.processSocketPacket()  FINEST: 7a1e9800-aaf2-4e4e-b9d1-ae3f09c4baff : TIMER (Canceling inactivityTimer: 10.0.1.9_5281_67.248.249.157_49634)
2014-06-25 15:40:39.498 [pool-13-thread-85]  BoshSession.processSocketPacket()  FINEST: 7a1e9800-aaf2-4e4e-b9d1-ae3f09c4baff : TIMER (Scheduling waitTimer: 10.0.1.9_5281_67.248.249.157_49634)
2014-06-25 15:40:39.498 [in_8-bosh]        BoshSession.disconnected()         FINEST:   7a1e9800-aaf2-4e4e-b9d1-ae3f09c4baff : TIMER (Canceling inactivityTimer: 10.0.1.9_5281_67.248.249.157_49634)
2014-06-25 15:40:39.498 [in_8-bosh]        BoshSession.disconnected()         FINEST:   7a1e9800-aaf2-4e4e-b9d1-ae3f09c4baff : TIMER (Setting inactivityTimer for 10 on: 10.0.1.9_5281_67.248.249.157_49634)
2014-06-25 15:40:40.196 [pool-13-thread-40]  BoshSession.processSocketPacket()  FINEST: 7a1e9800-aaf2-4e4e-b9d1-ae3f09c4baff : TIMER (Canceling waitTimer: 10.0.1.9_5281_67.248.249.157_49634)
2014-06-25 15:40:40.196 [pool-13-thread-40]  BoshSession.processSocketPacket()  FINEST: 7a1e9800-aaf2-4e4e-b9d1-ae3f09c4baff : TIMER (Canceling inactivityTimer: 10.0.1.9_5281_67.248.249.157_49634)
2014-06-25 15:40:40.196 [pool-13-thread-40]  BoshSession.processSocketPacket()  FINEST: 7a1e9800-aaf2-4e4e-b9d1-ae3f09c4baff : TIMER (Scheduling waitTimer: 10.0.1.9_5281_67.248.249.157_49634)
2014-06-25 15:40:40.200 [in_8-bosh]        BoshSession.disconnected()         FINEST:   7a1e9800-aaf2-4e4e-b9d1-ae3f09c4baff : TIMER (Canceling inactivityTimer: 10.0.1.9_5281_67.248.249.157_49634)
2014-06-25 15:40:40.201 [in_8-bosh]        BoshSession.disconnected()         FINEST:   7a1e9800-aaf2-4e4e-b9d1-ae3f09c4baff : TIMER (Setting inactivityTimer for 10 on: 10.0.1.9_5281_67.248.249.157_49634)
2014-06-25 15:40:42.122 [pool-13-thread-15]  BoshSession.processSocketPacket()  FINEST: 7a1e9800-aaf2-4e4e-b9d1-ae3f09c4baff : TIMER (Canceling waitTimer: 10.0.1.9_5281_67.248.249.157_49634)
2014-06-25 15:40:42.122 [pool-13-thread-15]  BoshSession.processSocketPacket()  FINEST: 7a1e9800-aaf2-4e4e-b9d1-ae3f09c4baff : TIMER (Canceling inactivityTimer: 10.0.1.9_5281_67.248.249.157_49634)
2014-06-25 15:40:42.122 [pool-13-thread-15]  BoshSession.processSocketPacket()  FINEST: 7a1e9800-aaf2-4e4e-b9d1-ae3f09c4baff : TIMER (Scheduling waitTimer: 10.0.1.9_5281_67.248.249.157_49634)
2014-06-25 15:40:42.123 [in_8-bosh]        BoshSession.disconnected()         FINEST:   7a1e9800-aaf2-4e4e-b9d1-ae3f09c4baff : TIMER (Canceling inactivityTimer: 10.0.1.9_5281_67.248.249.157_49634)
2014-06-25 15:40:42.123 [in_8-bosh]        BoshSession.disconnected()         FINEST:   7a1e9800-aaf2-4e4e-b9d1-ae3f09c4baff : TIMER (Setting inactivityTimer for 10 on: 10.0.1.9_5281_67.248.249.157_49634)
2014-06-25 15:40:42.365 [pool-13-thread-20]  BoshSession.processSocketPacket()  FINEST: 7a1e9800-aaf2-4e4e-b9d1-ae3f09c4baff : TIMER (Canceling waitTimer: 10.0.1.9_5281_67.248.249.157_49634)
2014-06-25 15:40:42.365 [pool-13-thread-20]  BoshSession.processSocketPacket()  FINEST: 7a1e9800-aaf2-4e4e-b9d1-ae3f09c4baff : TIMER (Canceling inactivityTimer: 10.0.1.9_5281_67.248.249.157_49634)
2014-06-25 15:40:42.365 [pool-13-thread-20]  BoshSession.processSocketPacket()  FINEST: 7a1e9800-aaf2-4e4e-b9d1-ae3f09c4baff : TIMER (Scheduling waitTimer: 10.0.1.9_5281_67.248.249.157_49634)
2014-06-25 15:40:42.369 [in_8-bosh]        BoshSession.disconnected()         FINEST:   7a1e9800-aaf2-4e4e-b9d1-ae3f09c4baff : TIMER (Canceling inactivityTimer: 10.0.1.9_5281_67.248.249.157_49634)
2014-06-25 15:40:42.369 [in_8-bosh]        BoshSession.disconnected()         FINEST:   7a1e9800-aaf2-4e4e-b9d1-ae3f09c4baff : TIMER (Setting inactivityTimer for 10 on: 10.0.1.9_5281_67.248.249.157_49634)
2014-06-25 15:40:42.586 [pool-13-thread-41]  BoshSession.processSocketPacket()  FINEST: 7a1e9800-aaf2-4e4e-b9d1-ae3f09c4baff : TIMER (Canceling waitTimer: 10.0.1.9_5281_67.248.249.157_49634)
2014-06-25 15:40:42.586 [pool-13-thread-41]  BoshSession.processSocketPacket()  FINEST: 7a1e9800-aaf2-4e4e-b9d1-ae3f09c4baff : TIMER (Canceling inactivityTimer: 10.0.1.9_5281_67.248.249.157_49634)
2014-06-25 15:40:42.586 [pool-13-thread-41]  BoshSession.processSocketPacket()  FINEST: 7a1e9800-aaf2-4e4e-b9d1-ae3f09c4baff : TIMER (Scheduling waitTimer: 10.0.1.9_5281_67.248.249.157_49634)
2014-06-25 15:40:53.095 [pool-13-thread-16]  BoshConnectionManager.serviceStopped()  FINE: 7a1e9800-aaf2-4e4e-b9d1-ae3f09c4baff : REMOVE (Closing bosh session)
2014-06-25 15:40:53.095 [pool-13-thread-16]  BoshSession.disconnected()       FINEST:   7a1e9800-aaf2-4e4e-b9d1-ae3f09c4baff : TIMER (Canceling inactivityTimer: 10.0.1.9_5281_67.248.249.157_49634)
2014-06-25 15:40:53.095 [pool-13-thread-16]  BoshSession.disconnected()       FINEST:   7a1e9800-aaf2-4e4e-b9d1-ae3f09c4baff : TIMER (Setting inactivityTimer for 10 on: 10.0.1.9_5281_67.248.249.157_49634)
2014-06-25 15:40:57.757 [pool-13-thread-75]  BoshSession.processSocketPacket()  FINEST: 7a1e9800-aaf2-4e4e-b9d1-ae3f09c4baff : TIMER (Canceling inactivityTimer: 10.0.1.9_5281_67.248.249.157_49661)
2014-06-25 15:40:57.757 [pool-13-thread-75]  BoshSession.processSocketPacket()  FINEST: 7a1e9800-aaf2-4e4e-b9d1-ae3f09c4baff : TIMER (Scheduling waitTimer: 10.0.1.9_5281_67.248.249.157_49661)
2014-06-25 15:41:07.905 [pool-13-thread-79]  BoshConnectionManager.serviceStopped()  FINE: 7a1e9800-aaf2-4e4e-b9d1-ae3f09c4baff : REMOVE (Closing bosh session)
2014-06-25 15:41:07.905 [pool-13-thread-79]  BoshSession.disconnected()       FINEST:   7a1e9800-aaf2-4e4e-b9d1-ae3f09c4baff : TIMER (Canceling inactivityTimer: 10.0.1.9_5281_67.248.249.157_49661)
2014-06-25 15:41:07.905 [pool-13-thread-79]  BoshSession.disconnected()       FINEST:   7a1e9800-aaf2-4e4e-b9d1-ae3f09c4baff : TIMER (Setting inactivityTimer for 10 on: 10.0.1.9_5281_67.248.249.157_49661)
2014-06-25 15:41:16.053 [pool-13-thread-78]  BoshSession.processSocketPacket()  FINEST: 7a1e9800-aaf2-4e4e-b9d1-ae3f09c4baff : TIMER (Canceling inactivityTimer: 10.0.1.9_5281_67.248.249.157_49677)
2014-06-25 15:41:16.053 [pool-13-thread-78]  BoshSession.processSocketPacket()  FINEST: 7a1e9800-aaf2-4e4e-b9d1-ae3f09c4baff : TIMER (Scheduling waitTimer: 10.0.1.9_5281_67.248.249.157_49677)
2014-06-25 15:41:26.545 [pool-13-thread-61]  BoshConnectionManager.serviceStopped()  FINE: 7a1e9800-aaf2-4e4e-b9d1-ae3f09c4baff : REMOVE (Closing bosh session)
2014-06-25 15:41:26.545 [pool-13-thread-61]  BoshSession.disconnected()       FINEST:   7a1e9800-aaf2-4e4e-b9d1-ae3f09c4baff : TIMER (Canceling inactivityTimer: 10.0.1.9_5281_67.248.249.157_49677)
2014-06-25 15:41:26.545 [pool-13-thread-61]  BoshSession.disconnected()       FINEST:   7a1e9800-aaf2-4e4e-b9d1-ae3f09c4baff : TIMER (Setting inactivityTimer for 10 on: 10.0.1.9_5281_67.248.249.157_49677)
2014-06-25 15:41:36.545 [scheduler_pool-3-thread-1-bosh]  BoshSession.task()  FINEST:   7a1e9800-aaf2-4e4e-b9d1-ae3f09c4baff : TIMER (inactivityTimer fired)
2014-06-25 15:41:36.545 [scheduler_pool-3-thread-1-bosh]  BoshSession.task()  FINEST:   7a1e9800-aaf2-4e4e-b9d1-ae3f09c4baff : REMOVE (Closing session, inactivity timeout expired)
2014-06-25 15:41:36.545 [scheduler_pool-3-thread-1-bosh]  BoshConnectionManager.addOutStreamClosed()  FINEST: 7a1e9800-aaf2-4e4e-b9d1-ae3f09c4baff : REMOVE (Closing bosh session)
2014-06-25 15:41:53.710 [pool-13-thread-35]  BoshConnectionManager.processSocketData()  FINE: 7a1e9800-aaf2-4e4e-b9d1-ae3f09c4baff : INVALID_SID (Invalid SID)
2014-06-25 15:42:27.065 [pool-13-thread-75]  BoshConnectionManager.processSocketData()  FINE: 7a1e9800-aaf2-4e4e-b9d1-ae3f09c4baff : INVALID_SID (Invalid SID)
2014-06-25 15:42:27.417 [pool-13-thread-13]  BoshConnectionManager.processSocketData()  FINE: 7a1e9800-aaf2-4e4e-b9d1-ae3f09c4baff : INVALID_SID (Invalid SID)
2014-06-25 15:42:27.762 [pool-13-thread-84]  BoshConnectionManager.processSocketData()  FINE: 7a1e9800-aaf2-4e4e-b9d1-ae3f09c4baff : INVALID_SID (Invalid SID)
2014-06-25 15:42:28.153 [pool-13-thread-49]  BoshConnectionManager.processSocketData()  FINE: 7a1e9800-aaf2-4e4e-b9d1-ae3f09c4baff : INVALID_SID (Invalid SID)
2014-06-25 15:42:28.540 [pool-13-thread-25]  BoshConnectionManager.processSocketData()  FINE: 7a1e9800-aaf2-4e4e-b9d1-ae3f09c4baff : INVALID_SID (Invalid SID)
2014-06-25 15:42:28.899 [pool-13-thread-21]  BoshConnectionManager.processSocketData()  FINE: 7a1e9800-aaf2-4e4e-b9d1-ae3f09c4baff : INVALID_SID (Invalid SID)
2014-06-25 15:42:29.247 [pool-13-thread-68]  BoshConnectionManager.processSocketData()  FINE: 7a1e9800-aaf2-4e4e-b9d1-ae3f09c4baff : INVALID_SID (Invalid SID)
2014-06-25 15:42:29.554 [pool-13-thread-69]  BoshConnectionManager.processSocketData()  FINE: 7a1e9800-aaf2-4e4e-b9d1-ae3f09c4baff : INVALID_SID (Invalid SID)
2014-06-25 15:42:30.047 [pool-13-thread-95]  BoshConnectionManager.processSocketData()  FINE: 7a1e9800-aaf2-4e4e-b9d1-ae3f09c4baff : INVALID_SID (Invalid SID)
2014-06-25 15:42:30.431 [pool-13-thread-90]  BoshConnectionManager.processSocketData()  FINE: 7a1e9800-aaf2-4e4e-b9d1-ae3f09c4baff : INVALID_SID (Invalid SID) 
2014-06-25 15:42:30.885 [pool-13-thread-10]  BoshConnectionManager.processSocketData()  FINE: 7a1e9800-aaf2-4e4e-b9d1-ae3f09c4baff : INVALID_SID (Invalid SID) 
2014-06-25 15:42:31.170 [pool-13-thread-6]  BoshConnectionManager.processSocketData()  FINE: 7a1e9800-aaf2-4e4e-b9d1-ae3f09c4baff : INVALID_SID (Invalid SID)
2014-06-25 15:42:31.551 [pool-13-thread-40]  BoshConnectionManager.processSocketData()  FINE: 7a1e9800-aaf2-4e4e-b9d1-ae3f09c4baff : INVALID_SID (Invalid SID) 
.....

There are 4765 INVALID_SID lines spanning 26 minutes and 12 seconds...or three invalid sid requests per second.

Of course I didn't make a tcp dump of the above (I will recreate this), so I don't know if the inactivity timer was correct in cleaning up, but the client should not be able to guess the sid, so it is likely the same client with the same session.

Added by Robert Larsen over 4 years ago

Ok, now I have something solid.

Try downloading this: http://www.cego.dk/sinner.zip

It contains a log file and a pcap.

The client sends several requests that Tigase does not respond to. The first stream is the initial BOSH handshaking. The next two streams are these:

<body rid='3846046447' xmlns='http://jabber.org/protocol/httpbind' sid='eb072392-cc88-44ea-87a9-47b2d60332c8'/>

Keepalives?

Tigase does not respond to them.

The fourth stream is the same request but this time Tigase responds with "invalid sid".

So, something prevents Tigase from responding to the requests and thus does not update the "last seen" timestamp causing it to time out. But why?

Also...this does not happen to every client.

Added by Wojciech Kapcia TigaseTeam over 4 years ago

OK, by default Tigase handles 2 concurrent requests, you can configure the number using following property:

bosh/concurrent-requests[I]=8

Added by Robert Larsen over 4 years ago

Increasing number of concurrent requests to eight didn't help. It takes between 15 and 30 minutes before we see the first problems with new connections (ones where we also have the connection creation in the log).

I have enabled WebSockets in the client code to see if this is a problem there too. About 7½% of our users don't have WebSocket support so this is not a solution but more like an extended test.

StropheJS should step down when seing an invalid sid message so there is definitely a problem there (most of the times it does but not always) so I have filed a bug report with them: https://github.com/strophe/strophejs/issues/76

But getting to the invalid sid in the first place is a problem too. The server is running on a large and powerful machine which is heavily underutilized both on CPU, RAM, disk and network and only around 1000 concurrent users are running the test so I don't believe this to be a resource problem.

Avatar?id=6023&size=32x32

Added by Artur Hefczyc TigaseTeam over 4 years ago

Robert, Just for the record. "Invalid SID" error is more or less normal and expected thing, especially when you connect real web client from real web browsers. This happens due to many factors: network is never super reliable, web browsers are not especially good at maintaining long lasting with no activity HTTP connections, browser is busy and does not reconnect within a timeout frame, there are proxies and firewalls in between which may also want to kill long lasting connections and some other things tend to happen.

Therefore there is no way you can completely make it disappear.

What you should focus on actually is to check how frequent this problem is, if it happens for, let's say 0.1% of user connections, than, maybe it is acceptable and not worth wasting time for investigation. If it happens for 50% connections than it might be serious. Another thing is, worth considering, whether the client handles this well. As this is something which just happens and there is not that much we can do, at least the client should handle it correctly. What means - handling correctly depends on your system and requirements.

Using Websockets would solve the problem for over 90% of your users which is a good sign. If the problem persist for 1% of remaining 10% of your users, it means there would be only 0.1% users affected. Might be acceptable.

I also suggest you learn to read the Tigase log files, especially bosh_sid.log*. It will allow you to investigate why this actually happen. In your attached log excerpt it is clear that the inactivity timeout expired, so Tigase has cancelled Bosh session for the SID. All subsequent client's connections we rejected. Maybe increasing *inactivity timeout would be a solution? If you say that it works better with open fire and ejabberd then maybe these other servers have a different inactivity timeout settings? On your Tigase installation it is set to 10 seconds which might be too low.

Added by Robert Larsen over 4 years ago

"Invalid SID" error is more or less normal and expected thing

Alright, I tried to increase the inactivity timeout:

bosh/max-inactivity = 30

However, then the BOSH port is never opened and the BOSH session manager is not found when I (admin rights) send a disco#items request to the server.

No exceptions in the log.

If I comment the max-inactivity setting out of the config, then BOSH is back.

Am I messing with the wrong setting?

Added by Wojciech Kapcia TigaseTeam over 4 years ago

It's

bosh/max-inactivity[L]=30

For the reference you can peruse etc/config-dump.properties

Added by Robert Larsen over 4 years ago

I have:

bosh/concurrent-requests[I] = 8
bosh/max-inactivity[L] = 30

...and also the sid log level set to 'FINEST', and this issue now seems to have been resolved.

The Strophe guys fixed the issue with not always respecting a 'invalid SID' error: https://github.com/strophe/strophejs/issues/76

...so now I see bad SIDs in the log once every five minutes or so which makes much more sense (before it was in the hundreds per second).

Load of course has dropped dramatically. It still jumps up and down between 0% and 90% but I don't think this is a problem.

Thanks for all the help

    (1-20/20)