How AWS helped us optimize memory usage in Tigase HTTP API
Recently we have moved our
xmpp.cloud (formerly branded as
sure.im) installation from one hosting provider where we used dedicated servers, to the Amazon AWS cloud based hosting. Benefits of this move are listed in this article. During migration we chose to use the smallest possible AWS instances which would be good for hosting the
t2.medium services. Installation performed without any issues, and test runs showed that the systems were properly operating. Should we would need to scale our installation another cluster node could always be started.
However, after some time we started to experience crashes on the new installation. The JVMs running Tigase XMPP Server were being terminated by Linux kernel due to memory allocation issues. In a typical situation, we would receive some
s from JVM notifying us that something is wrong and and that some adjustment would be needed for JVM memory settings. However this time, this was not the case. Instead JVM was being shutdown with a single entry being written to thetigase-console.log
file along with ahs_err_pid` file being created. The following entries were written to these files:
There is insufficient memory for the Java Runtime Environment to continue. Native memory allocation (mmap) failed to map 12288 bytes for committing reserved memory. Possible reasons: The system is out of physical RAM or swap space In 32 bit mode, the process size limit was hit Possible solutions: Reduce memory load on the system Increase physical memory or swap space Check if swap backing store is full Use 64 bit Java on a 64 bit OS Decrease Java heap size (-Xmx/-Xms) Decrease number of Java threads Decrease Java thread stack sizes (-Xss) Set larger code cache with -XX:ReservedCodeCacheSize= This output file may be truncated or incomplete. Out of Memory Error (os_linux.cpp:2640), pid=3633, tid=0x00007fe49c4c4700 JRE version: Java(TM) SE Runtime Environment (8.0_162-b12) (build 1.8.0_162-b12) Java VM: Java HotSpot(TM) 64-Bit Server VM (25.162-b12 mixed mode linux-amd64 compressed oops) Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
Native memory allocation suggested that we were having an issue not with java
HEAP size but rather an insufficient amount of free memory on our AWS instances. After verifying JVM memory settings, we found out that there was still plenty of free memory on the instance, so this issue should not occur. However, this issue was happening about once a day in frequency and it needed to be fixed. Since we are using Java version 8 at this point, the JVM memory is divided into:
We only had limits set for
HEAP so the issue must have been with
DirectMemory growing without any limits, aside from the amount of free RAM memory at our AWS instance.
Additionally, Tigase XMPP Servers at
xmpp.cloud installation were processing a lot of SPAM messages when crashes were happening. Due to that we suspected a leak in server-to-server (S2S) connection buffers as a lot of those connections was being created and many of them were saturated due to the high amount of incoming messages, most of which were SPAM. Knowing that, we decided to limit amount of memory allowed for
DirectMemory memory to 128MB.
Isolating cause of the issue¶
The following day a 2nd crash occurred, as they happened usually between 24 and 32 hours from when the server starts up. This time before the crash happened we received quite a few
OutOfMemoryError errors related to
MetaSpace. Errors were not pointing to native memory being depleted. All of the
OutOfMemoryError errors where thrown from the code responsible for handling HTTP requests. However, they did not point to any particular location within this code. Since no tests were currently implemented to measure
MetaSpace requirements for Tigase XMPP server installed at
xmpp.cloud, we proceeded with a simple fix.
MetaSpace was configured to use 256MB of space, and Tigase HTTP API was reconfigured to use Jetty API Server). Using Jetty instead of Java Embedded Server reduces the amount of
DirectMemory required for handling HTTP requests.
Unfortunately after 27 hours Tigase XMPP Servers at the
xmpp.cloud installation were again down and analysis still pointed to the Tigase HTTP API. This component was still the sole source for
OutOfMemoryError errors related to
MetaSpace which is allocated out of native memory (non-HEAP).
Analysis of the memory usage¶
Having the issue isolated we decided to replicate it in a controlled environment, measure memory usage and take a few memory dumps for comparison of memory usage in different periods. Knowing that it is related directly to HTTP API we focused on testing that component. We began with testing REST API requests which we had started to use internally for one of the new features yet to be introduced for users of
xmpp.cloud. During those tests
HEAP memory was almost empty and
MetaSpace usage increased slightly during the first part of the test. This behavior is expected due to Groovy scripts being compiled and loaded into memory. Later on
MetaSpace usage was fluctuating but more or less stable. Only
CodeCache space was changing due to JVM recompiling code to optimize its execution time. As direct calls to REST API were working fine, we had to focus on accessing the HTTP-API component using a web browser. This meant we needed to test the REST API, Admin UI, and other modules which are accessible from a web browser. Just after executing the first tests using the browser,
MetaSpace memory usage increased with each request and
MetaSpace grew until the set limit was reached. Then
OutOfMemoryError errors began to be thrown. Thanks to the memory dumps which were taken during those tests, we were able to identify which classes were using memory and which were allocated during each request. There we found a lot of classes containing
getTemplate within their names. Each class was named as
GStringTemplateScript with a number following, indicating multiple instances. As we are using
GStringTemplateScript from Groovy to create HTML output for the web browsers, we surmised that somehow this template engine is leaking memory by generating a new classes for each request, and not unloading the older classes when they are not needed. Our memory leak had been found.
Fixing the Issue¶
To fix the issue we started with code analysis to find usage patterns of
GStringTemplateEngine which lead to a leak. In our case, the leak was caused by an automatic reload mechanism of GString templates. These are stored in files under
scripts/rest/ directory of Tigase XMPP Server installation directory and are loaded when needed. To make development and customization of those templates easier, they were reloading requested templates on each HTTP request. To make it work fast we kept a single
GStringTemplateEngine instance (per servlet) which was handling every request. Previously, this mechanism saw a slow amount of
MetaSpace memory increase, and we had not before experienced such a rapid acceleration of memory use. But this instance of
GStringTemplateEngine had its own
ClassLoader and was internally keeping a reference to each class created during parsing of the GString template. This led to an increased usage of
OutOfMemoryError errors and eventually to crashes. Having the real cause of this issue pinpointed, we reviewed usage of
GStringTemplateEngine in our code and have changed them, to make sure that:
- We load all templates at once using single
GStringTemplateEngineand cache generate templates. No more automatic reloading of templates.
- When manual reload of templates is initiated we release old instance of
GStringTemplateEngineand parse templates using the new one.
This way we can still use
GStringTemplateEngine and our GString templates while maintaining stable
MetaSpace usage. As we have our template instances cached, responses for HTTP requests from the web browsers will now be faster. As for manual reloading, it will still generate new classes. However as we are releasing instance of
GStringTemplateEngine and its internal
ClassLoader we are releasing classes loaded by this
MetaSpace and making sure that this memory can cleaned by garbage collector. After extensive testing, we were able to confirm that memory usage is now stable.
What about Amazon Web Services?¶
As previously mentioned, we recently moved to AWS from our old hosting provided and enabled new feature for our users. This feature is based on Tigase HTTP API and REST API, and uses both APIs extensively. We needed to expose this HTTP-based API to our other services and to do that we decided to use AWS's Elastic Load Balancer to be able to transparently forward those HTTP requests to each of Tigase's cluster nodes. This way it would automatically switch to different nodes if one of them is overloaded or offline. Amazon's Elastic Load Balancer executes a health check requests every few seconds to be able to detect if destination host is up and running fine. In our case it was testing REST API and was generating responses formatted in HTML just as a normal web browser would. This lead
GStringTemplateEngine to be used for handling each request and each request creating new classes in the memory forcing JVM to use more and more memory until it ran out.
Thanks to AWS for helping us optimize memory usage in Tigase HTTP API