Tales from Observability series will delve into our ongoing pursuit of Observability at Adyen. Within these articles, we'll dive into the challenges we encounter while scaling our data pipelines to accommodate the ever-growing load of telemetry data. Additionally, we'll explore the tools we adopted to empower engineering at Adyen, fostering a sense of ownership over their service operations to achieve operational excellence.
Elasticsearch Performance Killers
Two years ago we started investigating a series of catastrophic failures of our Elasticsearch cluster which underpins our observability solutions. The cause of these failures was elusive and frustrating. Yet, the mitigation was deceptively simple. In this blog post we will retrace the steps taken by our Platform Observability team to find the needle in the haystack that caused the problems and learn a few things about Elasticsearch along the way.
The impact of cluster failure
Firstly, let’s emphasise the impact of a failure of Elasticsearch at Adyen. In the Platform Observability team, we make sure that the company has oversight over everything that happens on the platform – from business-as-usual monitoring, to support and incident response. To achieve this, we use Elasticsearch to make data easily available and queryable by our users.
It's important to note that our users are not necessarily technical and they rely on the platforms and tooling we provide to troubleshoot issues and quickly respond to incidents. With this in mind, let us share with you an incident that occurred back in 2021; when we really started to notice issues. We had recently upgraded the hardware that the cluster was running on, and we were looking forward to the additional performance and stability that this change would bring.
This optimism did not last too long: whilst we did indeed see an improvement in overall performance, there were still occasional outages. We received reports from users that dashboards were loading slower and slower until they eventually just failed. We checked the logs and confirmed that the cluster was not having a good time, but we could not immediately determine what the cause was. CPU was spiking, memory usage was spiking, garbage collection activity was high. Nodes would appear to drop out of the cluster, and ingestion was also failing.
All in all, the cluster was having a terrible day and we had no idea why this was happening? Without more knowledge of what was happening, we were left with the simple, yet painful and time consuming mitigation. “Restart all nodes!” We restarted all nodes and learned an interesting fact: during the worst outages, nodes don’t respond to SIGINT and have to be killed violently with SIGKILL in order to restart.
Investigations & Hypotheses
We had to get to the bottom of this issue, because restarting nodes manually led to long outages and it was taking up valuable time for our infrastructure team. Given that we had no concrete leads on the cause, the only way forward was to try to collect as much information as possible. We had a few hypotheses:
Some user queries were killing the cluster: To follow up on this, we had the option to enable slow logs to record when a query takes more than a predefined amount of time. Technically, this was easy to enable, but these logs would be stored in a separate Elasticsearch instance and our infrastructure colleagues were already spending a lot of their time maintaining Elasticsearch clusters. They were understandably hesitant when we told them that we wanted to add a new source of data to a cluster which had been one of the "stable ones". So, we had to guarantee that the new volume from the logs would not be too high and have a rollback plan.
Something was wrong with the cluster layout or the hardware: We thought maybe we had made some mistake with assigning master nodes or that there was a problematic connection between some of the nodes in the cluster. For this, we would just have to work with our colleagues in infrastructure to get access to repos where we can see all the juicy configuration details for both Elasticsearch and the machines that they were running on.
A bug in Elasticsearch?!: The Elasticsearch codebase is open-source, and although we had little hope of understanding the whole query path of the code in a short amount of time, we believed that if we skimmed the relevant parts of the codebase and read through the discussions on Pull Requests, we would discover something useful.
Although we learned a lot from the other two sources of information (points 2 and 3 above), our real breakthrough came from when we enabled the slow logs. This realisation is obvious in retrospect: CPU spikes and outages correspond with complicated queries. Complicated is doing a lot of heavy lifting; here, the complication was that some queries specified a lot of fields to group by. This gave us a lead to continue investigating, so we looked into how aggregations are computed in Elasticsearch.
Aggregations in Elasticsearch
In Elasticsearch clusters, the node that first receives a request becomes the coordinating node and is responsible for sending out requests to all the relevant data nodes and recombining the results from the data nodes.
Since the response is constructed on the coordinating node, and is fully constructed before sending back to the user, it became clear why we had an issue: The size of the response scales with the total cardinality of all the fields being grouped by and is constructed in memory. There's clearly a limit to the total cardinality that can be handled.
Luckily, there is already an easy fix for this: there is a cluster config option search.max_buckets that limits the total number of aggregation buckets a single query can construct. In our version of Elasticsearch, the default for this parameter is unlimited, but in later versions the max is a few thousands. Thinking we had found the root cause of our issues, we applied this change. To our surprise, there were relatively few broken dashboards despite the change not being backwards compatible.
Our deduction here is that users that wish to visualise some data typically don't want to be overwhelmed by many thousands of data points. So, most broken dashboards/clients could be fixed with small tweaks. In a few cases where large aggregations were truly required, we could use composite aggregations to split one large query into many smaller queries.
Investigation: timeout and slow/long running queries
Having done all this work and these investigations, we were looking forward to collecting some statistics on how much better the cluster was doing. There were still outages 🤦. There was some improvement, so our efforts weren't totally wasted, but we also clearly hadn't solved the issue.
We had to reconsider the facts: we could see that outages still correlated with some queries, but those queries didn't look complicated anymore. However, slow logs were showing that some queries were running for minutes despite the fact that we had set a 30 second timeout. For some reason, the timeout was not working and even attempting to cancel the queries explicitly didn’t have an effect. And again, nodes that were unresponsive did not respond to SIGINTs.
We analysed the slow queries thoroughly and could see that all the problematic queries were still using aggregations; although typically only one grouped by a single field. This confused us because search.max_buckets should be protecting us from large aggregations. It was at this point that our earlier investigation into the Elasticsearch repo became useful. We discovered more about how Elasticsearch attempts to optimise aggregation queries, and one feature in particular stood out – global ordinals.
In order to store data more efficiently, the Elasticsearch fielddata cache does not directly store all the text in a field. It stores an ordinal which refers to the text value. These ordinals are specific to their segment. So, even for the same text value, the ordinal can be different per segment. This poses a problem for running aggregations since it is not enough to just look at the ordinal values, and fetching the text value for all the documents would be expensive.
The solution to this problem is a technique called Global Ordinals. It's a mapping from the local segment ordinals to a globally unique ordinal value. This way, an aggregation query can use the global ordinals data structure to compute the counts for each bucket and the true text values only need to be fetched at the very end. The important points that created a problem:
Global ordinals need to be constructed every time the cluster does a refresh; which for us is every 30 seconds.
Global ordinals are computed lazily at query time when they are needed. This means that the cost of building global ordinals manifests as a slow aggregation query.
Worst of all, a query checks if it should timeout when it reads data and it crosses over from one segment to another. So a building global ordinal does not timeout and it cannot be cancelled.
Putting all of this together, can you now see what was happening? Consider that the size of the global ordinals’ data structure scales with the cardinality of the field. The problematic queries were aggregating over an id field, which naturally had high cardinality. The first query to aggregate over this field will cause global ordinals to be constructed. Building global ordinals takes "a long time", but it will not timeout because the points at which timeouts are checked are never reached. Hence, the nodes computing global ordinals become unresponsive. Building the global ordinals also creates a surge in garbage collection pauses. Master nodes consider these nodes to have left the cluster and shard rebalancing starts. Now, all nodes are suffering from the load of rebalancing the shards. Ergo, unhappy users. If you are familiar with the Elasticsearch query language, you know that it is possible to specify that global ordinals should not be constructed for a query. This will cause the query to fetch the text value during the aggregation. It would make it so that the query will check timeouts, and as the buckets are being constructed the search.max_buckets will prevent large aggregations. However, teaching our users about this option, and getting them to set this option in all cases where it was applicable sounded difficult.
We ended up deciding to disable aggregations on fields that we knew were ids. After making this change, and migrating users that depended on aggregations on ids, we immediately saw the effect we were hoping to see. We saw that query times improved immediately! Furthermore, these catastrophic failures where the cluster was incapable of serving requests vanished. We counted that as a massive success.
Our Key learnings
After this whole ordeal and our investigations, there were a few key learnings.
When in doubt, just collect more information: This can be concrete information like logs, or it can be something as simple as reading through github issues or talking to other people who might have ideas.
Make sure you have spare ingestion capacity: The speed at which we could recover acts as a multiplier for our mean time to recover. If we have an hour-long outage and we can recover at 2 seconds of delay per second, then it takes us another hour to catch up with the latest data; meaning we now have a 2 hour outage. If we were running the cluster close to max capacity, then there would be a massive multiplier that affects our mean time to recover. So, keeping some spare capacity is important for keeping our promised uptime. Precisely, for a recovery speed of r the total downtime will be multiplied by the following factor:
For example, with a recovery speed of r=2, the total downtime is multiplied by 2. If the recovery speed was r=1.2 then the total downtime is multiplied by 6!
It usually gets worse before it gets better: What we noticed was that even if some problematic queries were finishing; the fact that nodes had been unresponsive meant that the master node had considered those nodes dead and the shards that used to be on them had to be re-distributed. This is an expensive process, and would typically lead to the cluster being unavailable for longer. There might be clever tricks to avoid this, but shard rebalancing is unavoidable if you want to be resilient. There's not much that can be done about this, but it's good to be aware of.
Keep Elasticsearch up to date as much as possible: Especially with respect to configuration values like search.max_buckets, if we had been on a more recent version, we would have avoided one of these problems.
Additionally, it makes forum resources easier to use since, often, the recommendation is just to upgrade to a more recent version. At the end of the day, we were able to make a substantial improvement to the stability of the cluster and we hope you learned something from our struggles. Watch Eavan's talk on this topic with Elasticsearch meetup group here:
Also, we are hiring for a role in our Platform Observability team: check out this page for more about this job description or visit our tech career page for other roles.