Causes of Elasticsearch Performance Degradation Over Time
Left to its own devices, an Elasticsearch cluster ages and falters over time. Its median performance declines, the long tail of slow response times grows, and it is subject to ever greater risk of response time spikes caused by garbage collection or other issues. This is near entirely caused by updates and addition of data to the cluster; the character of a cluster and the level of attention and maintenance it requires is largely determined by the pace at which data changes.
Shard Growth Due to Updates to Existing Data
Elasticsearch doesn't actually update documents, it adds the update as a new document and marks the old as deleted. This causes growth in the size of indexes and their shards, which in turn slows processing and increases memory requirements. If taken to its extreme, eventually the growth in shard size becomes too much for cluster servers to handle gracefully at the expected level of traffic and available memory, and garbage collection events become large and frequent. This will happen unevenly across indexes, and thus across types of request, depending on which data is updated to a greater or lesser degree.
The solution to this issue is a schedule of periodic force merge operations, which clear out the deleted documents. Depending on the size of the cluster and the nature of its use, this may require failing over to a backup cluster for the duration of the force merge operation - which can be a matter of a day or more for very large clusters. Force merge operations don't consume a great deal of processing power, but they can significantly impact response time for queries touching the affected shards, and that can be unacceptable for some uses. Along with the time required to create and load data into a new cluster, this need for scheduled force merges is one of the more important determinants of cost and operational choices in the use of Elasticsearch.
Shard Growth Due to Poor Assignment of New Data
A good piece of general advice for planning an Elasticsearch cluster's data topology, the split of data into indexes and the number of shards and replicas, is to keep shard size to a few gigabytes. This is a much an art as a science, however, as it has to be balanced against the type of requests to be made and the nature of the data. Which data needs to be kept in the same index to ensure that requests don't hit too many different indexes, and thus become expensive? It is definitely the case that, all other things being equal, large shards are slow, however.
For any significant data set, establishing a decent data topology isn't as easy as it might appear at first glance. Significant effort on process and tooling can be required to plan the data topology for optimal distribution between instances. Importantly, those tools cannot just account for the data as it stands today, but must also account for the addition of new data tomorrow.
Consider, for example, a company that keeps data for its clients: if a successful business, it will be adding new clients constantly. Their data must be added to existing indexes, growing their shards, or given new indexes. How to decide how to approach this? To pick a naive example, if just throwing all new clients into their own index, eventually the shards will become too large and performance will suffer. Doing better than this requires the above mentioned process and tooling, and again, doing a passable job here can become quite complicated.
Data Growth in General
A successful data-focused company will accumulate new data at some pace. Even if Elasticsearch clusters are impeccably planned and the insertion of new data is carried out well, eventually they will run out of disk space and memory capacity. For very successful data-focused companies "eventually" might be three months from now. A plan to expand or incrementally replace Elasticsearch clusters has to be in place to accommodate data growth, or system architecture designed in such a way as to keep the segment of overall data that must be in Elasticsearch to a constant size. For example, only actively search on recent data.
Cloud Instance Issues
Cloud instances from any provider have some chance of going bad per unit time, whether that is outright failure, poor I/O performance due to disk or network issues, gaining a problem neighbor in the hosting environment, or other more esoteric issues. The larger the Elasticsearch cluster, the more instances are involved, and the greater the odds. The goal is to have a cluster large enough to be minimally affected by any one bad instance, and monitoring comprehensive enough to see the signs of most classes of issue in their earlier stages.
The most common signs of cloud instance problems are abnormal load on one instance, abnormal I/O values for one instance, or - absent those telltales - a small fraction of requests with unusually high response times that can be traced back to access to shards on one specific instance. Instances can go bad in ways that have a subtle, persistent impact on Elasticsearch cluster metrics over days or weeks if not identified and replaced.