Project

General

Profile

Tigase High Performance Configurations

Kulshreshth Dhiman
Added over 4 years ago

Hi,

I want to configure Tigase server to handle large message traffic.

Currently I have added following to my init.properties

--new-connections-throttling = 5222:10000,5290:10000,5280:10000

--sm-threads-pool = custom:100

--max-queue-size = 10000

Kindly provide me other configurations like number of scheduler threads, session manager threads and other configuration to meet this requirement.

Thanks and Regards

Kulshreshth Dhiman


Replies (16)

Added by Wojciech Kapcia TigaseTeam over 4 years ago

You could consider adjusting --cm-traffic-throttling and --cm-ht-traffic-throttling

As for the number of threads and sizes of queues - those are adjusted automatically based on the capabilities of the machine and by the rule of thumb should be good enough for most deployments. Everything beyond requires deeper knowledge and understanding of the particular deployment and expected traffic.

Avatar?id=6023&size=32x32

Added by Artur Hefczyc TigaseTeam over 4 years ago

What do you mean by "large message traffic"? Could you give us some numbers: what kind of traffic in total for the whole installation, what traffic for a single user's connection, how many users?

Most of our settings, limits and throttling is designed to limit traffic per single user's connection. It does not affect the overall traffic the system can handle. So, if you expect large traffic for a single user, you should adjust settings suggested by Wojciech, otherwise they will not affect you.

I strongly advise against playing with thread-pool and queue-size settings. Automatic values are usually optimal for most use-cases and offer the best performance with protection from the server overload.

Added by Kulshreshth Dhiman over 4 years ago

Hi,

I are typing to handle over 30K connections with around 30K text messages per day. For andriod clients, authentication process is taking too much time. Auth-success response from the server takes more than 20 secs and same is true for bind request as well. In order to speedup auth and bind process, I have created thread pool for auth and bind using the following settings with couple of other changes.

--new-connections-throttling = 5222:10000,5290:10000,5280:10000

--max-queue-size = 10000

--net-buff-high-throughput = 256k

--net-buff-standard = 32k

--sm-threads-pool=custom:100

--sm-plugins=session-close=6,session-open=6,jabber:iq:register=6,jabber:iq:auth=6,urn:ietf:params:xml:ns:xmpp-sasl=6,urn:ietf:params:xml:ns:xmpp-bind=6,urn:ietf:params:xml:ns:xmpp-session=6,message-all=1

--user-repo-pool-size = 20

--auth-repo-pool-size = 20

According to this document, http://docs.tigase.org/tigase-server/snapshot/properties_guide/html/#smThreadsPool

with coustom thread-pool settings, server will synchronize packets in order to maintain the packet order. Will there be performance impact due this synchronization?

Kindly suggest optimum configurations to solve this problem.

Thanks

Avatar?id=6023&size=32x32

Added by Artur Hefczyc TigaseTeam over 4 years ago

Kulshreshth, none of the settings you try will improve throughput on your installation. Really, the default settings are usually optimal and give you best performance.

If the performance is not satisfactory you should investigate where is the bottleneck and fix it. Possible areas to look at:

  1. Slow database

  2. High logging level (debugging switched on)

  3. Too low memory settings for Tigase

  4. To small server for the load or incorrect VM configuration

  5. Too high load put at the peak time. I mean, when you reconnect 30k users at the same time, there will be delays, even several seconds, because the system needs time to process the high volume of traffic and DB requests. This is normal and expected. However, once the users are all connected, there should be no more delays. Next users should be able to connect within 1 second or less.

  6. Make sure the system is not overloaded, this means you should actually reduce throttling settings or leave default values. If you allow for too much traffic on the server at once, the system may get overloaded and can become very slow.

Added by Steffen Larsen over 4 years ago

Also remember the os settings and limits. ulimit and tcp/ip stack in the kernel. I can easily reach around 80-90k pr server instance.

Added by Kulshreshth Dhiman over 4 years ago

Hi,

As you suggested I reverted these settings to default tigase settings(removed them from init.properties). I got the following error 8000 in 60 mins.

tigase.log.4:2014-11-08 19:18:12.130 [in_10-sess-man] ProcessingThreads.addItem() SEVERE: Packet dropped due to queue overflow: from=null, to=null, DATA=, SIZE=292, XMLNS=null, PRIORITY=SYSTEM, PERMISSION=NONE, TYPE=set

I am using tigase 5.2.2 released version. Do I need to increase internal queue size or increase session manager threads?

Kindly suggest the appropriate changes in the configurations.

Thanks and regards

Kulshreshth

Avatar?id=6023&size=32x32

Added by Artur Hefczyc TigaseTeam over 4 years ago

I am using tigase 5.2.2 released version. Do I need to increase internal queue size or increase session manager threads?

Increasing queue size will not help. It will just make your queues larger and eventually they will be overfilled as well. You need to investigate why your queues are being filled, in other words why your installation cannot process all the packets quickly enough. I have given you a few pointers above for what to look at.

If you provide us some more details about your system we may have some suggestions:

  1. What kind of HW do you use? Is it a real HW or a VM? (Memory, CPU, etc...)

  2. What are memory settings for the Tigase process?

  3. What traffic do you have on the system?

    1. How many logins/logouts per minute?
    2. What is average size of the contact list?
    3. How often presence status changes on average per user?
    4. How much traffic of messages do you have? (messages/minute)
    5. Do you record messages or any other data to database?
    6. Do you use MUC, PubSub, if yes, what kind of traffic is there (number of rooms, occupants, messages posts to room)
    7. Any other traffic is put on the server?
    8. Is debug log disabled on the server?
  4. What database do you use, is it installed locally or on a different server, is it optimized to cope with the traffic?

  5. I suggest to collect the server statistics and check which queues are getting overfilled

Avatar?id=6023&size=32x32

Added by Artur Hefczyc TigaseTeam over 4 years ago

Another thing. This might be related to your performance problems. A while ago we found a performance issue which may affect some users in some cases and we fixed it. Please check ticket: #2309.

This fix was not yet back ported to branch 5.2.x so the version you are using may suffer from the performance problem fixed in the ticket. Maybe you would like to try our last nightly build of the 7.x branch to see if this resolves your performance issues?

Added by Kulshreshth Dhiman over 4 years ago

Hi,

Here are the details,

  1. Tigase is deployed on AWS virtual machine having 16GB memory and 8 core CPU.

  2. Memory settings of the tigase

JAVA_OPTIONS="${GC} ${EX} ${ENC} ${DRV} -server -Xms100M -Xmx8G -XX:PermSize=128m -XX:MaxPermSize=512m -XX:MaxDirectMemorySize=256m

  1. No idea about login/logout per min

  2. We are not using roasters.

  3. On an average for all users there are 27 presence status change per second

  4. 3316 in 90 mins

  5. No, we don't archive the messages

  6. No, we are not using MUC/PUB Sub

  7. No, there isn't any

  8. only SEVERE logs

    1. We are using mysql database on the same server with 180 pool connections

On web-sockets, it works pretty fast but in case of tcp connections, we are facing too slow authentication, bind and delayed messages. We found that after ClientConnectionManager.processPacket() process, message get processed instantaneously but reaching to this point takes 20-30 secs. It seems that server is slow in reading socket data. Should we increase DEF_MAX_THREADS_PER_CPU parameter in SocketThread class from 8 to 32?

We have also change linux settings to avoid buffer overflows. We have 24K 5222 connections and 12K web-socket connections.

Please find attached my linux settings and tigase stats dump.

Thanks and regards,

Kulshreshth

linux settings.txt (611 Bytes) linux settings.txt Linux Settings
tigaseStats.xml (82.1 KB) tigaseStats.xml Tigase Stats
init.properties (2.31 KB) init.properties init.properties
Avatar?id=6023&size=32x32

Added by Artur Hefczyc TigaseTeam over 4 years ago

AWS instances, typically offer very poor performance I/O unless you use high I/O instances. The poor IO affects network performance as well as database/HDD performance.

Your servers metics review seems to be confirming that this is the case on your installation. Long processing time for all plugin using database causes a slowdown.

At the time when metrics where taken all queues are empty. So it looks like the situation on the server stabilized. Did you experience long login time at this point or it happens only after the Tigase is restarted and everybody attempts to connect?

Average logins per second is 25 on this installation. But this is really an average from the starting the Tigase. What is most interesting is the peak login rate. My assumption is that at peak login time, when everybody attempts to connect, the authentication queue grows, hence long waiting time for successful log. Large number of Authentication timeouts confirm that.

Is this a system where real user connect or this is a system we put some load tests? If this is a load test, than how many machines do you use to simulate user's traffic? Make sure the load generating machines are not overloaded.

Added by Kulshreshth Dhiman over 4 years ago

Hi,

These stats we taken when the sever got stabilized (takes 20-30 mins to stabilize). There are long delays in auth, bind and message exchange even after the sever got stable. At any point of time, queues of the tigase remains empty except that there could be small buildup(2-10, atmax 20) in sess-man queue. Our average auth time as in our stats is around 30ms(sometimes less than this). At the client side there is auth timeout after which they start a new connection again.

It seems that IOs of the server are too slow and a result of which whole process is slowing down. Please suggest some mechanism/tool to confirm this. Btw this is system where real users connection. We are also using tigase 8080 HTTP service to run some groovy scripts, would that make any difference? Also, would increasing socket threads of tigase improve the response time?

Thanks and regards

Kulshreshth

Avatar?id=6023&size=32x32

Added by Artur Hefczyc TigaseTeam over 4 years ago

These stats we taken when the sever got stabilized (takes 20-30 mins to stabilize). There are long delays in auth, bind and message exchange even after the sever got stable. At any point of time, queues of the tigase remains empty except that there could be small buildup(2-10, atmax 20) in sess-man queue. Our average auth time as in our stats is around 30ms(sometimes less than this). At the client side there is auth timeout after which they start a new connection again.

Indeed. If, after the timeout the client retries again, without waiting, then it makes things even worse. My suggestion is to modify the client (if this is a custom client) to wait some time after timeout and retry later.

If the service stabilizes and there are no queues in the Tigase server, how long does it take to authenticate the user? With 30ms DB access it should not take longer than 1-2 seconds. If there are no queues inside Tigase it should really not take long time, unless there is really something on low-level network which slows it down. You mentioned that it takes several seconds from sending a packet from a client to show it in ClientConnectionManager in Tigase. This is something I have never seen before.

It seems that IOs of the server are too slow and a result of which whole process is slowing down. Please suggest some mechanism/tool to confirm this. Btw this is system where real users connection.

We are also using tigase 8080 HTTP service to run some groovy scripts, would that make any difference?

I do not think so.

Also, would increasing socket threads of tigase improve the response time?

You can try but I really doubts it would help. It could even make things worse. If the system waits for IO, adding more threads does not change anything. I would try a few other things:

  1. Try our latest nightly or modify the version you use with the fix I mentioned above. It may improve performance.

  2. Try more recent JDK version, even JDK8

  3. Make sure you use JDK from Oracle, not the OpenJDK, we experienced some performance problems with OpenJDK in the past

  4. If you can, try to run Tigase on some highest IO Amazon instance, to see if this makes any difference

I have run some tests myself a while ago and even gave presentation about it:

https://www.dropbox.com/s/u8ri7unb70139xj/Tigase%20On%20Amazon.pdf

Take a look, maybe you find it useful. From my tests, it looks like dedicated HW is more cost effective and offers much better performance.

Lastly, we could investigate the problem for you to and help you to optimize the system.

Added by Kulshreshth Dhiman over 4 years ago

Hi,

Thanks for your suggestions. We have shifted tigase to network optimized AWS instance with default linux settings and tigase settings provided in previous mail. Now auth and bind process are pretty fast but we still see a noticeable delay while sending message packets but receive is fast. If messages are sent using desktop client using web-sockets then the communications is fast both way. We have 10K for web-socket (5290) and 50K connections for TCP (5222).

Do we require to do some settings on tigase end?

Thanks

Kulshreshth Dhiman

Avatar?id=6023&size=32x32

Added by Artur Hefczyc TigaseTeam over 4 years ago

My suggestion is to update to just released 5.2.3 version which may improve performance. Except this, my other suggestion is to look at the Tigase metrics for queues and average processing time for potential bottlenecks.

Added by Kulshreshth Dhiman over 4 years ago

Hi,

We increased ServerSocketChannel in ConnectionOpenThread class backlog while bind to 2048 which solved our problem.

ssc.socket().bind(isa, 2048);

Thanks

Kulshreshth

Avatar?id=6023&size=32x32

Added by Artur Hefczyc TigaseTeam over 4 years ago

Thank you for the hint. This certainly makes sense. I have also updated our code, however, I changed your code a bit to:

ssc.socket().bind(isa, (int) (port_throttling));

So the backlog is set to a value of open connection throttling property.

    (1-16/16)