Tigase Cluster Presence Issue
ALL of these tests are done using WebSocket connections.
Summary: We are having issues where bulk login users cannot see each other if they are on different servers.
Essentially in bulk login everyone on server1 can see each other and everyone on server2 can see each other,
but no one on server1 can see anyone on server2 and vice-versa.
If you do the logins sequentially (wait for 1 to finish logging in before beginning 2) everyone on both servers can see everyone.
I think the issue here is the new feature you guys added to make sure not to send presence to an offline user. Because they are all logging in at roughly the same time when server1 notifies server2 of its online users server 2 maybe hasn't realized that its users are logging in and so it doesn't notify anyone and vice-versa. It seems this is essentially a race-condition where users are all authenticating at the same time (relatively) and therefore cannot see each other.
--------------------Ex. 1------------------------------ We had 10 users all on one server1 (with logic to reconnect) We turned on servers 2 and 3 (with no users on them) Then we shutdown server1 so that everyone on server1 would reconnect randomly to server2 or server3 users 1, 5, 7, 8 ended up on server 2 users 3, 4, 6, 9, 10 ended up on server 3 All users on server 3 (3,4,6,9,10) saw each other as available but saw NO one on server 2 (1, 5, 7, 8) and vice-versa -------------------------------------------------------
++++++++++++++++++++Ex. 2++++++++++++++++++++++++++++++ We had 10 users distributed across servers 1 and 2 all could see all online. (with logic to reconnect) We turned on server 3 and shutdown servers 1 and 2 at same time. All 10 users ended up on server 3 and all 10 users could see each other online. +++++++++++++++++++++++++++++++++++++++++++++++++++++++
====================Ex. 3============================== We had 10 users offline. With servers 1 and 2 active. We logged in user 1. After user 1 was fully online we logged into user 2. After user 2 was fully online we logged into user 3. (continued for all 10 users) users 1, 3, 5, 7, 9 ended up on server 1 users 2, 4, 6, 8, 10 ended up on server 2 ALL 10 users could see all 10 users as online ======================================================
**********************Ex. 4*************************** This one was slightly more complicated.. Beginning case: All 10 users were online on server 1. (5 with logic to reconnect 5 without) I will signify these as A1-A5 and B1-B5 where A will reconnect and B will not. We started servers 2 and 3 and after they were online we shutdown server 1. Server 2 ended up with: A1, A3, A5 Server 3 ended up with A2, A4 A1 could see A3 and A5 online and vice versa. A2 could see A4 online and vice versa. No one could see B group online (because they had not reconnected) Continuation: 2 minutes after logging in all of A group, we logged in all of B group at same time. Server 2 ended with: A1, A3, A5, B2, B4 Server 3 ended with: A2, A4, B1, B3, B5 A1 could see: A3, A5, B1, B2, B3, B4, B5 A2 could see: A4, B1, B2, B3, B4, B5 B1 could see: A1, A2, A3, A4, A5, B3, B5 B2 could see: A1, A2, A3, A4, A5, B4 Group A can see everyone from group B (who logged in later) AND everyone from Group A who was on the same server. Group B can see everyone from Group A (who logged in earlier) AND everyone from Group B who was on the same server. **********************************************************
From these experiments the only thing we can conclude is that clustering has trouble notifying presence when users login to different servers at the same time. If they log into the same server there is no problem and if they login to different servers with time in between there is no problem. Due to the recent addition of logic to ensure that you do not send presence to offline users I would assume this has something to do with it.
Added by Wojciech Kapcia about 5 years ago
John Catron wrote:
Due to the recent addition of logic to ensure that you do not send presence to offline users I would assume this has something to do with it.
While this is true - we added logic to optimize handling presence to offline users it's only enabled if clustering is turned off or you use ACS. With default clustering, which you are using (according to init.properties), mechanism should be off (you could check the logs for entry Skipping sending presence to offline contacts enabled). In addition, feature could be turned of completely by using following settings:
Cold you verify that feature is turned off and/or try to reproduce the issue with above settings to rule out that it's related to this mechanism?
Added by John Catron about 5 years ago
You were correct in your assumption that the skip-online code had nothing to do with this my apologies it just seemed only logical conclusion giving data I had at the time.
Since then, however, we have figured out what the issue is and we have setup a hack to work around the issue until such time you guys are able to update it to not happen.
The issue is a race-case with Clustering. After starting the tigase service it takes several seconds (up to a minute) to establish clustering with the other servers. Any users who connect after tigase service is running but before clustering is fully established will have this presence issue.
We currently have setup two work-arounds for this:
1) Firstly our client application now sends a second presence update roughly 1 minute after login just in case the server it logs into had restarted this way when clustering is established the user will have sent a second updated presence.
2) Our second workaround is via IPTables. When starting the service we block the ports for external clients (5290, 5291, 5222 etc.) using IPTables until after clustering is established and then open IPTables that way no clients may connect until the service is truly ready.
While these work arounds seem to have solved our problems for now it would be very nice if we didn't have to worry about these things. If there is any way for servers with clustering to queue things such as presence until clustering is established or to not begin listening on the client ports until after clustering is established this would help us out a lot as well as any other Tigase consumers experiencing similar issues.
Added by Artur Hefczyc about 5 years ago
This is strange, there is really nothing to "establish" for the default clustering. The only establishment is connecting all cluster nodes together. If there is some problem on the network level which prevents or slows down connecting cluster nodes together than, this, indeed may cause problems. Wojciech will look at it and will ask you for more details.
Added by John Catron about 5 years ago
I attached some logs showing the establishment we are talking about. It takes roughly 30 seconds to a minute from time the server starts listening on the client ports until those messages are in log (meaning that presence will work). Until the log shows those messages, however, anyone who connects will have issues with presence. After those log messages the presence is fine.
For now we are ignoring the warning about time on server we think this happens when a new node comes online and tries to connect a node specified in the clustering DB that isn't there (since we were starting 2 servers at once).
These logs show the step we called 'establishing' the clustering.
Added by Artur Hefczyc about 5 years ago
Yes, this is strange. I cannot think of anything which may cause such a problem. It definitely requires some closer examination.
I am not sure if I have already mentioned to you, but all the cluster nodes must be in sync with a correct time. If time difference is too big they will never connect to each other as cluster nodes, if the time difference is smaller it may still cause some problems. I cannot say if this is what happens in your case however.
Added by Wojciech Kapcia almost 5 years ago
Artur Hefczyc wrote:
Wojtek, when we talked about this last time you mentioned that you committed some fixes for this already but the problem appears to be still affecting some users. So this might have more than one cause.
There wasn't any commit regarding the issue yet, only discussion about possible causes but it turned out they were wrong.