More Lessons from Working with Elasticsearch at Scale

Elasticsearch is a powerful tool, but increasingly fragile, finicky, and needful of constant attention as the size of data and frequency of updates to that data grow large. It is next to impossible to predict outcomes for a sizable use case in advance, and experiment is really the only practical way to find out how things will work out in practice. It takes a very long time - hours or even days - to recover from dramatic failures, and thus both careful operations planning and the existence of backup clusters are a necessity. Now that I'm back to wrangling sizable Elasticsearch clusters on a more regular basis, I thought I'd add a few more items to last year's list of useful items that I wish I'd come to understand earlier in my association with Elasticsearch.

Force Merging is not Optional

Elasticsearch and frequent updates to data don't play well together; the character and overall cost of a cluster, including the effort needed to maintain it, is greatly determined by the pace of updates. Updates must be processed through the cluster rules, which tends to be expensive in comparison to the average query. Further, and more importantly, updates don't change existing data, they write a new version. Over time this will degrade cluster performance until it eventually fails catastrophically at query loads it could handle perfectly well when new.

The way to prevent this decline from occurring is to make use of the force merge feature, called the optimize API in earlier versions. This removes all of the duplicated, updated data.

curl -XPOST http://localhost:9200/_forcemerge?max_num_segments=1

Unfortunately this is costly, and short of shutting down the cluster a force merge cannot be stopped once it has started. It isn't unreasonable to expect force merging to run for a day in larger clusters, and it will significantly add to the load on cluster nodes during that time - quite likely to to the point of disaster for a heavily trafficked cluster. You will want to make use of the backup clusters you have in place (you do have backup clusters, right?) and plan to swap traffic from primary to backup on a regular basis in order to run the force merge process without affecting response times. How often to run? That really depends on the frequency of updates, and thus the pace at which performance declines, but I've seen both weekly and monthly schedules running in the wild.

Monitor Everything, and Visualize that Data in Dashboards

Every cluster should have its own comprehensive metrics dashboard to present a holistic view, broken down by servers: JVM properties such as heap usage, server load and CPU utilization, disk writes, traffic in and traffic out, query rates, query response times, shard allocations, force merges, and so forth. Datadog or a similar service is a sensible choice for this sort of visualization. Without a very detailed view of this nature, diagnosing issues in any reasonable amount of time is next to impossible. This is the best way to find bad nodes, poor index placement, and many other problems that cannot be easily revealed using the built-in cluster health tools.

There is No Easy Escape from the Downward Spiral of Cluster Failure

Once an Elasticsearch cluster starts to show load issues and raised response times, there is next to nothing that can be done to rescue it while still continuing to service the current level of incoming traffic. If the decline is due to too many updates occurring since the last force merge, then triggering a force merge will almost certainly push things over the edge due to the additional load it imposes.

But wait, isn't this Elasticsearch; can we not just add extra nodes to increase capacity? Well sure, but adding extra nodes without first throttling down the pace of shard allocation will just mean a jump in load as nodes move shards onto the new, empty nodes. Again, this may well just tip things over the edge. If you do throttle down shard allocation, or manage it manually to the point at which the additional load will not make things worse, then it will take a significant amount of time for the new nodes to start helping - possibly hours for large clusters.

The point here is that there is no easy way out while still continuing to serve traffic on the cluster exhibiting issues. Near everything that be done will bring either significant additional load in the short term, or take a long time, or both. The right course of action is to fail over immediately to your backup cluster (you do have backup clusters, right?), and then force merge, add nodes, or perform whatever other curative action is needed on the primary cluster in the absence of traffic.

Logs can be Dangerous, So Pay Attention to their Configuration

Elasticsearch clusters can be set to fairly freely reallocate shards based on available space, aiming to distribute load as best as possible. Elasticsearch nodes can also log voluminously when under heavy load - the slow log in particular can become very large very quickly. Careful attention has to be given to log management strategies such as rotation and compression, otherwise it is possible for a heavily trafficked cluster to fill up enough of its space with logs to cause problems. It the most pathological cases, a cluster will start to cycle shard allocations back and forth constantly between nodes, generating network transfer costs and degrading performance to the point of failure.

A Single Bad Node Can in Fact Bring Down an Entire Cluster

One of the selling points of Elasticsearch is that it is resilient to failure; given enough capacity, the failure of a single node will cause no great harm. It will be replaced, shards allocated to the replacement, and life will go on. Sad to say, there are definitely failure modes in which a single bad node can prevent the whole cluster from recovering. I've seen a bad network filesystem mount cause this in an earlier Elasticsearch version; the node itself was reporting itself healthy, but the shard data on the network mount was in the sort of nebulous state that a bad filesystem will produce - access to the filesystem intermittently hanging, responses never returned to file listing commands, that sort of thing. The entire cluster was effectively left waiting on the one node to respond to a low-level interaction of some sort.