Project

General

Profile

Problem with Stream Management

Luca Stucchi
Added about 4 years ago

Hi there,

i am trying to make stream management work, testing this scenario

  • a client connected with stream management on, on a mobile phone

  • tigase server 7.0.0 with stream management on

  • the client is connected to the server

  • I set the mobile in airplane mode

  • I send a message to the client

in the log I find that the message (together with presence/IQ/...) arrives to the server, and I get the resumption timeout check in place...

2015-02-25 11:32:16.446 [socketReadThread-8]  StreamManagementIOProcessor.serviceStopped()  FINEST: c2s@acme.domain.net/172.31.30.109_5222_10.0.0.1_49322, type: accept, Socket: TLS: c2s@acme.domain.net/172.31.30.109_5222_10.0.0.1_49322 Socket[unconnected], jid: 103@acme.domain.net/MyResource, service stopped - checking resumption timeout

After a while it looks like the stream gets closed (I see many messages saying that)

2015-02-25 11:32:51.313 [scheduler_pool-6-thread-1-c2s]  ClientConnectionManager.xmppStreamClosed()  FINE: Service stopped, sending packet: from=c2s@acme.domain.net/172.31.30.109_5222_10.0.0.1_49322, to=sess-man@acme.domain.net, DATA=<iq type="set" from="c2s@acme.domain.net/172.31.30.109_5222_10.0.0.1_49322" to="sess-man@acme.domain.net" id="c4502e14-0454-4f0b-9144-f0327b94b5b0"><command node="STREAM_CLOSED" xmlns="http://jabber.org/protocol/commands"><x type="submit" xmlns="jabber:x:data"><field var="user-jid"><value>103@acme.domain.net/MyResource</value></field></x></command></iq>, SIZE=375, XMLNS=null, PRIORITY=SYSTEM, PERMISSION=NONE, TYPE=set

2015-02-25 11:32:51.313 [in_3-message-router]  MessageRouter.processPacket()  FINEST:   Processing packet: from=103@acme.domain.net/MyResource, to=acme.domain.net, DATA=<iq type="error" from="103@acme.domain.net/MyResource" to="acme.domain.net" id="tigase-ping"><ping xmlns="urn:xmpp:ping"/><error code="404" type="wait"><recipient-unavailable xmlns="urn:ietf:params:xml:ns:xmpp-stanzas"/></error></iq>, SIZE=244, XMLNS=null, PRIORITY=NORMAL, PERMISSION=NONE, TYPE=error

2015-02-25 11:32:51.313 [in_5-message-router]  MessageRouter.processPacket()  FINEST:   Processing packet: from=c2s@acme.domain.net/172.31.30.109_5222_10.0.0.1_49322, to=sess-man@acme.domain.net, DATA=<iq type="set" from="c2s@acme.domain.net/172.31.30.109_5222_10.0.0.1_49322" to="sess-man@acme.domain.net" id="c4502e14-0454-4f0b-9144-f0327b94b5b0"><command node="STREAM_CLOSED" xmlns="http://jabber.org/protocol/commands"><x type="submit" xmlns="jabber:x:data"><field var="user-jid"><value>103@acme.domain.net/MyResource</value></field></x></command></iq>, SIZE=375, XMLNS=null, PRIORITY=SYSTEM, PERMISSION=NONE, TYPE=set


2015-02-25 11:32:51.314 [session-close Queue Worker 3]  SessionManager$SessionCloseProc.process()  FINEST: Executing connection close for: from=c2s@acme.domain.net/172.31.30.109_5222_10.0.0.1_49322, to=sess-man@other.domain.net, DATA=<iq type="set" from="c2s@other.domain.net/172.31.30.109_5222_10.0.0.1_49322" to="sess-man@other.domain.net" id="c4502e14-0454-4f0b-9144-f0327b94b5b0"><command node="STREAM_CLOSED" xmlns="http://jabber.org/protocol/commands"><x type="submit" xmlns="jabber:x:data"><field var="user-jid"><value>103@acme.domain.net/MyResource</value></field></x></command></iq>, SIZE=375, XMLNS=null, PRIORITY=SYSTEM, PERMISSION=NONE, TYPE=set
2015-02-25 11:32:51.315 [session-close Queue Worker 3]  SessionManager.closeSession()  FINE: Closing connection for: 103@acme.domain.net/MyResource
2015-02-25 11:32:51.315 [session-close Queue Worker 3]  SessionManager.closeSession()  FINE: Found parent session for: 103@acme.domain.net/MyResource

But at a certain point we get the weird thing:

2015-02-25 11:32:51.315 [in_7-sess-man]    SessionManager.processPacket()     FINEST:   processing packet: from=103@acme.domain.net/MyResource, to=sess-man@other.domain.net, DATA=<iq type="error" from="103@acme.domain.net/MyResource" to="acme.domain.net" id="tigase-ping"><ping xmlns="urn:xmpp:ping"/><error code="404" type="wait"><recipient-unavailable xmlns="urn:ietf:params:xml:ns:xmpp-stanzas"/></error></iq>, SIZE=244, XMLNS=null, PRIORITY=NORMAL, PERMISSION=NONE, TYPE=error, connection: XMPPResourceConnection=[user_jid=sess-man@other.domain.net, packets=18, connectioId=null, domain=other.domain.net, authState=NOT_AUTHORIZED, isAnon=false, isTmp=false]

2015-02-25 11:32:51.315 [in_7-sess-man]    SessionManager.walk()              FINEST:   XMPPProcessorIfc: UrnXmppPing (urn:xmpp:ping)Request: from=103@acme.domain.net/MyResource, to=sess-man@other.domain.net, DATA=<iq type="error" from="103@acme.domain.net/MyResource" to="acme.domain.net" id="tigase-ping"><ping xmlns="urn:xmpp:ping"/><error code="404" type="wait"><recipient-unavailable xmlns="urn:ietf:params:xml:ns:xmpp-stanzas"/></error></iq>, SIZE=244, XMLNS=null, PRIORITY=NORMAL, PERMISSION=NONE, TYPE=error, conn: XMPPResourceConnection=[user_jid=sess-man@other.domain.net, packets=19, connectioId=null, domain=other.domain.net, authState=NOT_AUTHORIZED, isAnon=false, isTmp=false]
2015-02-25 11:32:51.316 [in_7-sess-man]    SessionManager.processPacket()     INFO:     Impossible happened, please report to developer packet: from=103@acme.domain.net/MyResource, to=sess-man@other.domain.net, DATA=<iq type="error" from="103@acme.domain.net/MyResource" to="acme.domain.net" id="tigase-ping"><ping xmlns="urn:xmpp:ping"/><error code="404" type="wait"><recipient-unavailable xmlns="urn:ietf:params:xml:ns:xmpp-stanzas"/></error></iq>, SIZE=244, XMLNS=null, PRIORITY=NORMAL, PERMISSION=NONE, TYPE=error, connection: XMPPResourceConnection=[user_jid=sess-man@other.domain.net, packets=19, connectioId=null, domain=other.domain.net, authState=NOT_AUTHORIZED, isAnon=false, isTmp=false].

If you notice, the connection close is sent by another subdomain (other.domain.net). Is this OK / expected ?

the IQ stanza with the STREAM_CLOSE arrives to the session manager, but, surprise surprise, I get a message with "Impossible happened"... (may I humbly request to consider changing this message ? after two hours of debug, you can expect everything, but "impossible happened" is a little too much to digest ! :) )

Apart from the message, do you see anything wrong, but most of all, can you understand what is happening ? Everything because, sadly, when resumption_timeout passes, messages aren't neither persisted on database nor delivered if the client reconnects. Am I doing something wrong ? I'd be glad to find out so, or even find out that there is a problem that I can help fixing (as I always say to myself, we all love open source for that ! )

Thanks in advance,

Luca


Replies (11)

Added by Ganesh Krishnan about 4 years ago

I have the same issue. If the app doesn't send an unavailable status (for example put in airplane mode) the message is just lost because the server thinks the connection is still valid. It takes few minutes(?) for the server to close the connection.

Once the server closes the connection the messages are persisted in the database. I could not find a quick way to resolve this issue so I implemented read receipts in the client and resend a message if the client has not read the message (once I get the client presence as online).

Added by Andrzej Wójcik IoT 1 CloudTigaseTeam about 4 years ago

I did analysis of parts of logs you attached and from what I see:

  1. STREAM_CLOSE being sent but other subdomain is not something that should cause issue

  2. Impossible happened is reported as packet is sent from closed connection reporting error while it's session in SessionManager is already closed. This is something which in code before StreamManagement should not happen, but this also is not cause of messages not being saved to offline store.

I suppose that your message is not saved because you sent message to full jid of recipient and in this case messages sent to full jid are not stored in offline store as in some cases (with random resources) they would be never delivered to recipient. So instead of storing messages set to full jid we send message back as an error. I would suggest to check if you received error message on client from which you sent message.

Added by Luca Stucchi about 4 years ago

Hi there,

I performed some tests and I found out that messages sent to both bareJid and fullJid are stored, but after a huge amount of time (15 minutes). Below you can find the settings in my init.properties

c2s/processors[s]=urn:xmpp:sm:3
c2s/watchdog_delay[L]=60000
c2s/watchdog_timeout[L]=10000
c2s/watchdog_ping_type=xmpp

In the log I see three types of row:

  • ping (every minute) for 15 minutes

  • messages with added to the original content (to is the unavailable recipient, so I will never get this)

  • messages with added to the original content (still to is the unavailable recipient, so I will never get this)

When I get these messages (so after more than 15 minutes) I have a message sent to amp@acme.domain.net that causes the storage to be triggered.

Now, I expected the storage to trigger in 60 seconds, not 15 times slower. You could answer that the operative system needs time to detect the disconnection, but we already applied the indications that we found on the forum.

Could you indicate me how to obtain a faster storage, or at least having the error messages sent earlier, so that we can be aware of that ? And most of all I'd want to solve the problem for which if I request the message after 2 minutes, the message will be lost, while if I request after 16, I'll have it.

Thanks in advance,

Luca

Added by Luca Stucchi about 4 years ago

Hi again,

investigating, I found out that the 15 minutes delay is caused by the operative system time to discover a disconnection, and I obtained it following tigase indication at http://docs.tigase.org/tigase-server/5.2.3/admin_guide/html/#__etc_sysctl_conf

Now, I expected that after the stream_management timeout the message would be persisted on database. Instead it looks like the message is saved in memory (and the session resumed) within the timeout, but once I pass this timeout, I have to rely on disconnection detection and not on stream management.

Maybe I am missing the point, but what I would like to implement is exactly that behavior: within 60 seconds I keep the message in RAM so that the session can be resumed, after that it is persisted to database.

Do you see any way to obtain this ? Even not directly, for example using a custom component, or whatever other way that could make the requirement fulfilled...

Thanks in advance,

Luca

Added by Andrzej Wójcik IoT 1 CloudTigaseTeam about 4 years ago

Hi,

I'm not 100% percent sure what you are trying to do but here is how Stream Management is working currently in Tigase XMPP Server:

  • every packet is stored in queue waiting for confirmation from client

  • when tcp connection to client was broken (so disconnection was detected) then Stream Management starts timeout during which it keeps packets in memory

  • if session was resumed then packets are delivered just as normal (same as in case in which connection was not broken)

  • if session was not resumed then messages are processed by ClientConnectionManager by method processUndeliveredPacket which decides what should be done with each packet not delivered to recipient

Metod processUndeliveredPacket in ClientConnectionManager users following logic to decide what to do with underlivered packet:

  • if packet was of type error it ignores this packet

  • if packet contained presence then it is also ignored

  • if packet was message then it is marked with and sent back to SessionManager in which this packet will be processed once again by every processor but now in same way as it was packet sent to user when it was offline (if it is message sent to bare jid it will be stored in offline message store and if message is to full jid it will result in sending back to sender recipient-unavailable error)

  • if packet was of other type (only iq should be left now) it will be send back to sender with recipient-unavailable error

So to sum up: Stream Management should keep stanzas in memory (by default) for 60s since detection of connection failure, and after that messages sent to full jid would be sent back as recipient-unavailable errors and messages sent to bare jid would be sent to offline.

It is possible to override ClientConnectionManager.processUndeliveredPacket method to force it to sent undelivered messages directly to offline store but it still would be done after detection of connection failure.

Added by Luca Stucchi about 4 years ago

Hi Andrzej,

I had to meditate a little bit on this: you are telling that Tigase now waits for X seconds after connection disconnection was detected, and them saves the packets on out database.

So my attention needs to be shifted on the disconnection detection. Following your indications on http://docs.tigase.org/tigase-server/5.2.3/admin_guide/html/#__etc_sysctl_conf we set

fs.file-max=1000000
net.ipv4.ip_local_port_range=1024 65000
net.ipv4.tcp_keepalive_time=60
net.ipv4.tcp_keepalive_probes=3
net.ipv4.tcp_keepalive_intvl=90

With these settings we obtain a 15 minutes delay (even if following the math at http://www.gnugk.org/keepalive.html it should be 60 + 90 + 90 + 90 seconds = 330 s = 5m 30s), after which the process of being stored will be started.

But my question is: to have a shorter disconnection detection time, would you recommend playing with these values ? Or it would be better not to do it ? Is there anything that we can do to lower this detection time ?

This is because we want to have the message stored on a persistent storage ASAP in case of non delivery scenario, but of course this kind of modifications can't be made lightheartedly !

BTW, default values are

net.ipv4.tcp_keepalive_time=2 hours
net.ipv4.tcp_keepalive_probes=9
net.ipv4.tcp_keepalive_intvl=75

That means with the modification I made I should pass to 5m30s from 2h11m15s. Do you think it makes sense to follow this way to reduce those values again ?

P.S. in your post at http://docs.tigase.org/tigase-server/5.2.3/admin_guide/html/#_tcp_keepalive you have a dead link to blog.kolargol.eu that looks to be expired and available to be purchased... just to let you know !

Added by Andrzej Wójcik IoT 1 CloudTigaseTeam about 4 years ago

How do you simulate connection failure?

I suppose that if you do this by killing some test application on other host it is possible that OS on computer which hosts client application may "think" that connection is still valid and respond TCP keepalive packets while app is already killed. This would explain why your settings changes reduced connection failure discovery but not to value you expected. This would not happen if connection failure would be caused by ie. hardware failure (ie. unplugged ethernet cable).

I would not recommend to reduce values for keepalive as in more busy network or in mobile environments (including mobile and WIFI networks) this may lead to issues with connection stability (TCP failures may be very frequent) which may have bad influence on whole system (client and server), however it may be good to use other values in some cases.

Added by Luca Stucchi about 4 years ago

Hi Andrzej,

I simulate connection failure connecting from a mobile device and then putting it on airplane mode, so the server will not be notified of anything from the device. This should replicate quite well the use case of a network failure, and it could be compared to the cable unplugged.

The watchdog tries to ping the device every X second as I told it, so that part is great, but before a disconnection to be detected it passes far more than the expected 5+ minutes - (almost 15 minutes)

After that, I get for every single message a 404 message recipient-unavailable as the one below

<error type="wait" code="404"><recipient-unavailable xmlns="urn:ietf:params:xml:ns:xmpp-stanzas"/></error>

after which the message will be stored

2015-03-13 11:54:05.151 [in_0-amp]         JDBCMsgRepository.storeMessage()   FINEST:

Does this seem consistent to you ? My only aim is to have the storeMessage() called earlier, any suggestion ?

All the best,

Luca

Added by Julia Zashchitina almost 3 years ago

Hello everyone.

We had the same problem with losing messages on Wi-fi disconnect, so we've enabled Stream Management on server (Tigase version 7.0.2) with the following properties:

--sm-plugins = +message-archive-xep-0136,-message-carbons,-amp,+msgoffline,+message

c2s/watchdog_delay[L]=60000
c2s/watchdog_timeout[L]=60000
c2s/max-inactivity-time[L]=60
c2s/processors[s]=urn:xmpp:sm:3

If we understood Andrzej Wójcik correctly, messages sent to fullJid should not be stored in offline database. However, there are 2 rows in tig_pairs table and messages are delivered twice when user becomes online:

1Offline Storage - mww-212.company.com

1Offline Storage - mww-212.company.com

Could you please suggest how we can fix this?

Added by Andrzej Wójcik IoT 1 CloudTigaseTeam almost 3 years ago

Thank you for reporting this issue. I will work on solution for this issue. Right now I see that it happens only when you use -amp,+message,+msgoffline instead of default configuration in which amp is only enabled processor handling messages.

To fix this issue right now I could suggest to replace following part of etc/init.properties file:

--sm-plugins = +message-archive-xep-0136,-message-carbons,-amp,+msgoffline,+message

with following:

--sm-plugins = +message-archive-xep-0136,-message-carbons

however this will enable AMP feature.

I do not think this issue is related to StreamManagement.

I created task #4120 to find cause of this issue and fix it.

Added by Julia Zashchitina almost 3 years ago

Thank you, Andrzej!

    (1-11/11)