Project

General

Profile

Presence Probe process may impact on Tigase performance

Matthew M
Added about 5 years ago

We did a small performance test. We created some users, each one has 500 contacts. Then we found if there are some concurrent users logging in, Tigase was slowed down a lot, and some of presence stanzas will be lost (some users never gets online-presence from some contacts).

After profiling and tracing the Tigase java process, we found it is busy doing presence probe. Basically, upon each user's logon, tigase is generating presence probe packets for ALL contacts, regardless the contacts are online or not. Then calling canHandle() to see if this presence probe stanza can be delivered.

The major time consuming part is canHandle() / checkPacket(), and the system becomes very busy. During this time, the login process from any other regular user becomes very slow.

I wonder if this process could be improved, can the server skip such probe packet generation, if the server already knows the contacts are not online? To simplify things, we can assume that in our system, all contacts are in the same domain (which means they are hosted by the same Tigase server). Or in other words, is it possible to quickly tell a contact is online or not?

In the worst case, I wonder what's the consequence to disable all the outgoing of presence probe? From the XMPP doc, it does not seems to be that useful, or I missed something...

Thanks!


Replies (6)

Avatar?id=6023&size=32x32

Added by Artur Hefczyc TigaseTeam about 5 years ago

500 contacts in the roster is a bit extreme, especially if you connect many users at the same time. I know this generates a high traffic and high load. Could you please share some more details abut your test?

  1. New user connection/disconnection rate

  2. Do you have any custom logic in any plugin, Presence plugin especially

  3. What kind of HW you are running Tigase on? Is it a real HW or a VM, number of CPUs/CPU cores, Memory size,

  4. GC settings

  5. What kind of software you use for users simulation

  6. Do you have any stats from the Tigase server, I am interested particularly in plugins stats - queues size, average processing time

The reasons for sending presence probe to all contacts, even local users which in theory are known to be offline is:

  1. The same logic applies to all users, so the implementation is simpler, more reliable and better tested

  2. For security reasons user's data are very well separated from each other, so within a plugin context API has access only to data of the user for which the data is being processed. This is to prevent from exploiting bugs on other holes in the implementation and attempting to load other users data by attacker. I know it does not give you 100% guarantee but still, increases security

  3. In a clustered environment the contact may not be online on one cluster node but can be connected to a different cluster node. Depending on configuration this information can be available on all nodes or not.

Presence probe are essential from the XMPP spec point of view and they cannot be disabled without breaking the spec. They really are useful to get a presence from a contact which is online while the roster owner is just connecting. Please mind we are talking about all the possible subscription states which are taken care of. Please note, the presence probe traffic problem exists only at the user login time and only when a large number of users connect at the same time. However, with 500 contacts in the roster if people change their status every 5 or 10 minutes you would quickly find that the probes are the minor problem and the real problem is the user status presence traffic. On some installations it gets well over 500k per second.

This does not mean you have to give up on hope. There are ways to improve presence probes or presence traffic in general. We have done this before and could reduce almost to zero unnecessary presence traffic. However, there is not generic way to optimize for this, at least not a generic way which does not break the XMPP spec. We usually analyze the traffic on the customer system and the specific requirements of the system, then it allows us to select optimal method for reducing presence traffic and improve the system throughput.

There is also some experimental API available in Tigase which could be used for a generic presence optimization but it is not yet well explored and tested so it is not put into production yet.

Added by Matthew M about 5 years ago

Many thanks for the reply and insights! I can explain a bit more of our use case.

I tried to set up a IM server for organization for internal collaborations. The only main customization I made is to provide a dynamic roster for each members, so that they have the "contacts" immediately without adding manually. In general, every member wants talk to each other in the same organization.

I agree 500 contacts does not seems to be small, however

  1. it is very typical for an organization to have 500 members.

  2. all of them are local users (which is important for us to optimize, see below)

  3. Most of all, there are not many concurrent online uses. There might be 5% to 10% users log on at the same time at most. This is reasonable.

From a system point of view, it's very hard for us to explain the facts that the system's load is proportional to the "total offline users". Intuitively, for any system, the offline users should not stir up too much load to the system.

I understand XMPP needs presence probe to work, but I guess it subject to the implementation, to avoid excessive CPU load on offline users.

I also agree with your points of simplification and security, but I do have a couple of questions.

  1. Would it be even simpler, that if I clearly know the contacts are local and offline, the presence probe should be skipped? Since Tigase maintains the local users, there should be some simple and safe method to tell "contacts is local and offline", without leaking any private information.

  2. For security and DoS related, it seems that a user can slow system, by adding a lot of friends, and simply log on and log off by itself, then creating a lot of internal presence probe traffic, even though no one else is online. Should this be considered too?

  3. What if I completely disable the probe in the server? What will happen? I understand probe is useful ONLY for remote users, but if I assume all JIDs are local, can we safely turn off probe? If not, what would exactly happen?

However, with 500 contacts in the roster if people change their status every 5 or 10 minutes 
you would quickly find that the probes are the minor problem 
and the real problem is the user status presence traffic. 
On some installations it gets well over 500k per second.

I think this is indeed an important problem. What would be your suggestion? It would be great if could some share some insights with us. We are able to control (most of) our clients to help the optimization if needed. But server side optimization would be super.

For your questions:

  1. New user connection/disconnection rate

One time log on, 20 - 50 users log on at the same time, each has 500 contacts

  1. Do you have any custom logic in any plugin, Presence plugin especially

No, only a dynamic roster to generate roster items

  1. What kind of HW you are running Tigase on? Is it a real HW or a VM, number of CPUs/CPU cores, Memory size,

VM, Centos 6, 2GB memory, 2 CPU

JVM: -Xms100M -Xmx200M -XX:PermSize=32m -XX:MaxPermSize=256m -XX:MaxDirectMemorySize=128m

  1. GC settings

No special or customized settings at all

  1. What kind of software you use for users simulation

We use the Tigase Java xmpp client library to write a very simple client, and log in by command line. No other interactions.

  1. Do you have any stats from the Tigase server, I am interested particularly in plugins stats - queues size, average processing time

Great point, how to get server status? We would love to view and report that too.

After all, I think the main issue might be in the logic of generating probe stanzas even the contacts are well known to be offline. I guess the system should avoid any load for anything related to offline users (local users) to make it scalable, and robust agains DoS attack (otherwise, a single user with simple log on operation can drives up CPU).

Thanks again for your discussion!

Avatar?id=6023&size=32x32

Added by Artur Hefczyc TigaseTeam about 5 years ago

Let me address all your points one by one....

Matthew M wrote:

Many thanks for the reply and insights! I can explain a bit more of our use case.

I tried to set up a IM server for organization for internal collaborations. The only main customization I made is to provide a dynamic roster for each members, so that they have the "contacts" immediately without adding manually. In general, every member wants talk to each other in the same organization.

I agree 500 contacts does not seems to be small, however

  1. it is very typical for an organization to have 500 members.

  2. all of them are local users (which is important for us to optimize, see below)

  3. Most of all, there are not many concurrent online uses. There might be 5% to 10% users log on at the same time at most. This is reasonable.

I did not realize this system is just for max more or less 500 users online. Now I understand the use case better. My remarks about 500k packets per second traffic is for systems with over 1 million online users in a cluster mode.

In reality you expect to be up to 50 users online at any point in time. For such an installation we really should not need any special code customization to the system. We should rather focus on a proper configuration of the system.

From a system point of view, it's very hard for us to explain the facts that the system's load is proportional to the "total offline users". Intuitively, for any system, the offline users should not stir up too much load to the system.

I agree. However, here, offline users do not stir up any load. The load is generated by users which connect to the system. The XMPP spec requires the server to send presence probe and initial presence to all contacts.

In case of users located on other servers you do not know who is online and who is offline until you "probe" for his presence status. Tigase treats all users as remote, there is no special treatment for local users.

Of course special treatment for local users could give a huge performance improvements and this is something we are looking at.

  1. Would it be even simpler, that if I clearly know the contacts are local and offline, the presence probe should be skipped? Since Tigase maintains the local users, there should be some simple and safe method to tell "contacts is local and offline", without leaking any private information.

Not only probe could/should be skipped but also initial presence could/should be skipped. As I said there is some experimental API we are working on which is supposed to expose some limited user's data in safe way.

  1. For security and DoS related, it seems that a user can slow system, by adding a lot of friends, and simply log on and log off by itself, then creating a lot of internal presence probe traffic, even though no one else is online. Should this be considered too?

A few users with even very large rosters do not impact the system in a significant way. What matters is an average roster size and login/logout rate, that is how many users per second login to the system, request roster and exchange initial presence probe and status.

Please note preventing the excessive load issue for local users only does not solve the DoS risk as malicious users can simply add huge number of remote accounts to their roster and perform login/logout. This would cause the same or even higher load then if there are only local users.

  1. What if I completely disable the probe in the server? What will happen? I understand probe is useful ONLY for remote users, but if I assume all JIDs are local, can we safely turn off probe? If not, what would exactly happen?

Probe is useful for ALL users not just remote. The presence workflow is like this:

  1. User logins

  2. User sends his initial presence to the server

  3. The server sends user's initial presence to all contacts

  4. The server sends user's probe presence to all contacts

  5. Contacts which are online respond to probe with their own presence (technically contact's server responds on behalf of the contact) the probe is never sent to the end-client

For your questions:

  1. New user connection/disconnection rate

One time log on, 20 - 50 users log on at the same time, each has 500 contacts

By the connection (login) rate I meant how many new users per second connect to the server. This is what causes the load. When users are already logged in and if they do not send any data (presence or messages) they do not generate any traffic and they do not generate any load on the server. The biggest impact on the load is user's login time (authentication, roster retrieval, presence exchange, vcard, etc...)

Let's assume you connect 50 users at the same exact second. Each user has 500 contacts of which 10% (50) are online. The XMPP traffic generated by this is approximately:

  1. User login (session, resource all the initial handshaking) about 50 * 10 = 500 packets

  2. Roster retrieval 50 * 2 = 100 packets

  3. VCard 50 * 2 = 100

  4. User's presence to the server generates: 50 * 1 initial presence client to server + 50 * 500 presence probes server to server + 50 * 500 initial presence server to client + 50 * 50 responses to probe form 10% online contacts = 50 + 25,000 + 25,000 + 2,500 = 52,550 packets

Total traffic generated in 1 second by 50 users connecting during this 1 second is about 53,200 packets of which 52,550 is presence traffic.

Of course once these users login the traffic is close to none.

Now the question is what can we get if we make the optimizations for local users to send presences only to online users. We send probes and initial presences only to online users which is 50 in our case:

50 * 1 initial presence client to server + 50 * 50 presence probes server to server + 50 * 50 initial presence server to client + 50 * 50 responses to probe form 10% online contacts = 50 + 2,500 + 2,500 + 2,500 = 7,550 packets

Still a lot of traffic within one second but huge saving.

Please note, this is just for your use case which is kind of special. For all other systems we dealt until now the benefit is not so huge. First of all user's roster is not that big. Most systems have average contact list size about 20, some 50.

Additionally many contacts are on remote servers for which we cannot use the optimization, therefore, until now, while we were aware that such optimization is possible it was not our priority.

It seems to be in your case we/you should focus on survive user's login time. If you really expect in production that 50 users connect at the same time and if that happens

very often you should have the system which can handle that. There are 2 options:

  1. We could work together somehow on the software optimization

  2. You could adjust your HW configuration to handle that load

  1. Do you have any custom logic in any plugin, Presence plugin especially

No, only a dynamic roster to generate roster items

  1. What kind of HW you are running Tigase on? Is it a real HW or a VM, number of CPUs/CPU cores, Memory size,

VM, Centos 6, 2GB memory, 2 CPU

JVM: -Xms100M -Xmx200M -XX:PermSize=32m -XX:MaxPermSize=256m -XX:MaxDirectMemorySize=128m

  1. Is it a real HW or a VM?

  2. Do yo have any other software running on the system or this is just for Tigase?

  3. Is it possible to increase mx memory for Tigase? While 200M seems to be enough for a normal usage, giving more memory would give some extra space to Tigase to have larger queues during traffic spikes (login of 50 at the same time) so there is no packets drop which can be processed over time.

  1. GC settings

No special or customized settings at all

  1. What kind of software you use for users simulation

We use the Tigase Java xmpp client library to write a very simple client, and log in by command line. No other interactions.

  1. Do you have any stats from the Tigase server, I am interested particularly in plugins stats - queues size, average processing time

Great point, how to get server status? We would love to view and report that too.

There are 2 ways to get details stats from Tigase

  1. Via XMPP - you can connect with Psi client for example, browse service discovery to stats component, double click on the component to get a list of essential metrics. You can adjust detail level to get more stats

  2. JMX - you can connect either using JConsole, Tigase Monitor or attached simple command line tool to get all the server stats

After all, I think the main issue might be in the logic of generating probe stanzas even the contacts are well known to be offline. I guess the system should avoid any load for anything related to offline users (local users)

to make it scalable, and robust agains DoS attack (otherwise, a single user with simple log on operation can drives up CPU).

While I agree the system can be optimized and we are working on this, it really does not solve DoS attach danger as described above. Tigase really is scalable it works well for systems with over 1 million online users

but your use case is not typical.

We will try to introduce the optimization we are talking about here. However, the presence workflow is not as simple as it looks, and the new logic needs some time before it is mature and well tested against all border cases. We used to have a similar optimizations a while ago which was pulled from the main code branch due to some spec issue.

Avatar?id=6023&size=32x32

Added by Artur Hefczyc TigaseTeam about 5 years ago

I am adding this as separate comment to avoid getting lost of the option in rather long response above. You can try following configuration option in your init.properties file:

--skip-offline=true

This may slightly improve things. However, not much as this settings does not optimize traffic at login time but rather a normal traffic after users are logged in and initial presence/probe traffic is exchanged. Therefore in your use case it should really not matter.

Added by Matthew M about 5 years ago

Many thanks to your comprehensive explanation, and the tips of "--skip-offline" options!

I think most of our concerns are well addressed by your answers.

Some more details about our server enviroment

  1. it's a VM (using XEN)

  2. there are no other software

  3. Yes, we should be able to increase the memory (we thought 50 users are light weight load, before we realize the details probe stage)

I still have one more question about login process and probe:

Contacts which are online respond to probe with their own presence (technically contact's server responds on behalf of the contact) the probe is never sent to the end-client

If I understand correctly, if ALL JIDs are local, then we might be able to skip the entire probe process, no matter for online and offline users. Because of

  1. Probe is generated by user's server on behalf of the user

  2. technically contact's server responds to the probe on behalf of the contact

  3. in our particular case, user's server and contact's server are the same server, and basically the same JVM instance.

I know by XMPP spec such probe should go through the standard stanzas. But to me, if we know everything is within same JVM, it seems there should be a very simple way to skip probe for the entire roster completely, and simply let the server send presence response back to the user on behalf of online contacts. I know it sounds a bit hack-ish... but do I break anything here? Assume there is no remote JIDs (or we can process it differently).

After all, I agree with you, 50 users logs on at the exact second is a rare event, and the ongoing presence stanza exchange after log in is a much bigger issue to deal with. But I just want to fully understand the probe process and consequence of hacking... :)

Thanks again for your help!

Avatar?id=6023&size=32x32

Added by Artur Hefczyc TigaseTeam about 5 years ago

I still have one more question about login process and probe:

Contacts which are online respond to probe with their own presence (technically contact's server responds on behalf of the contact) the probe is never sent to the end-client

If I understand correctly, if ALL JIDs are local, then we might be able to skip the entire probe process, no matter for online and offline users. Because of

  1. Probe is generated by user's server on behalf of the user

  2. technically contact's server responds to the probe on behalf of the contact

  3. in our particular case, user's server and contact's server are the same server, and basically the same JVM instance.

I am not sure if you noticed from my explanation above that the server sends on user behalf 2 presence stanzas to each contact:

  1. User's initial presence - meaning of this is: Hey, this is my online status presence

  2. Presence probe - meaning of this is: Hey, I am interested in your presence as well, please send me your online status presence back

Now, the contact's server interprets both presence packets in a following way:

  1. User's initial presence - oh, this is user's online status presence, is the contact really interested in this user's presence (checks subscription state, from or both) if yes, the server then checks privacy lists and other blocking mechanisms, if nothing forbids sending the presence, the server forwards the presence to the contact's client, if not drops the presence

  2. User's probe - oh, that user is interested in the contact online status presence, is he allowed to see the contact presence? (checks subscription state, to or both) if yes, the server then checks privacy lists and other blocking mechanisms, invisibility state, etc.... if nothing forbids then the server sends contact's online status presence back to the user's server, if not it may respond with an error or just drop the presence

Now, in your particular case you always have both subscription state and probably your users do not use privacy lists or invisibility or stuff like this. So, in your case the above workflow could be simplified to something like this:

  1. User's server sends user's online status presence to the contact

  2. Contact's server treats this online status presence as both

    1. the status presence
    2. probe presence

and performs above actions (forwarding user's presence to contact's client and responding with contact's presence)

The above workflow is possible just remember, user's status presence will be exchanged many times but you treat it as a probe only the first time it is received from an online user.

Also, in this case user's server and contact's server is the same entity so it is possible to improve the system and lower traffic (load) by skipping sending presences to offline users.

However, if you ever plan in the future to open your installation to any external XMPP/Jabber servers (jabber.org, gtalk, etc...) your modifications may cause problems with exchanging users' status between systems.

    (1-6/6)