High Offset Queries are Expensive in Elasticsearch
One of the causes of high load per query in Elasticsearch is queries that spread out across many indexes. Another is querying for even just a few documents at a very high offset. Below, for example, the first query is much more expensive than the second, and that is entirely a function of the offset into the data:
{ "size": 10, "from": 20000000, "query": { "bool": { "must": [ { "term": { "store": "example store" } } ] } } }
{ "size": 10, "from": 0, "query": { "bool": { "must": [ { "term": { "store": "example store" } } ] } } }
To serve the high offset query, Elasticsearch must load up and sort from + size
documents - which is 10 or 20000010 in the examples above, a large difference. A sizable number of high offset queries in a large data set can bring even a reasonably sized Elasticsearch cluster to its knees. Naive architectural design often makes that possible; a young company rarely thinks far enough ahead to see the very large data of the future, and an application and its data layer is initially only constructed for the modest data of the startup period. End users are allowed to page through all of the data, and thus success brings growing pains. What to do about this?
Add More Servers
Adding more servers to the cluster will help with many problems, at least in the short term. It has the advantage of being quick, low-risk, and easy to accomplish. It has the disadvantage of being very expensive. A cluster that serves high-offset queries in the naive fashion might have to be twice as large as one that doesn't, and even then those queries will be slow to return results.
Limit the Offset
The best way forward, where it is possible, is to design an application architecture and a product that does not have to support high offset queries. For example, limit requests to only a recent subset of the data - perhaps by moving old data to a system that works in a different way, offering only precalculated results for a few classes of query. Does every user really have to be able to access every last piece of ancient data in the archive? Is it vital that search engines crawl absolutely everything?
Planning in advance what will happen when there are tens of millions of documents rather than a few million will save everyone a great deal of pain. Of course that planning rarely happens to a sufficient degree, and by the time it becomes apparent that major chances would be needed to shift to a better architecture, the budget and the will required is absent. Smaller changes are made to cope instead.
Scroll
The scroll functionality is intended to solve the high offset query issue for use cases in which the user is paginating because he or she wants a sizable subset of the sorted data from a query. Scroll can be used as a relatively cost-effective way of grabbing that data in bulk - so as to make it available for download, for example - rather than have a user hammering the Elasticsearch cluster with many high-offset queries.
The primary limitation here is that scroll is envisaged as an asynchronous operation, and is still costly enough that one wouldn't create a direct line between front-end application and Elasticsearch to allow people to carry out scroll requests as desired. There must be some limits applied. Nonetheless, this can serve as a replacement function in many classes of application, and thus allow offsets into the data to be limited for all end users. It can't help all that much in situations where it is necessary to support synchronous high-offset queries from those end users.
Search After
The search after functionality was introduced in Elasticsearch 5. It works like a normal offset, but uses values of the sort parameters rather than a document index. It solves the issues of cost for many parallel queries in no particular pagination order carried out against the same sorted query; the first is expensive, the others less so.
{ "size": 10, "query": { "bool": { "must": [ { "term": { "store": "example store" } } ] } }, "search_after": [ 12000000 ], "sort": [ { "purchase_id": { "order" : "asc" } } ] }
There are limits and downsides. The offset in a search after query becomes a value of the sort parameter, which changes the logic of pagination significantly. It will still perform just as badly as a standard offset query for random high-offset pages from multiple different queries. The sort order can change as a set of data is walked, page to page, if either (a) the underlying data is edited, or (b) the combination of sort fields do not uniquely specify each document. In the latter case, it is recommended to use the document _id
, but this makes it impossible to do anything other than paginate forwards in a linear fashion.
Cache Sort Order at the Application Level
An alternative to search after that retains pagination by document index is to cache the sort order of a query at the application level - either in application servers, in a shared memory cache, or similar. Whenever a high-offset query arrives, first load document IDs (either _id
or some useful uniquely valued indexed field) for the entire data set, given the desired sort, in order to cache the ordering of documents. This can be accomplished in blocks, and might either use search_after
or the old-style pagination as shown below:
{ "size": 10000, "from": 0, "fields": [ "purchase_id" ], "query": { "bool": { "must": { "term": { "store": "example store" } } } }, "sort": [ { "purchase_id": { "order" : "asc" } } ] }
Given a cached sort order, any later request for a small number of documents at a given offset can then be quickly loaded by requesting the specific document IDs:
{ "query": { "bool": { "must": [ { "ids": { "values": [ "301", "302", ... ] } } ] } }, "sort": [ { "purchase_id": { "order" : "asc" } } ] }
The advantage of this approach is that it provides greater control over consistency of document ordering, length of time cached document ordering is held in the cache, and degree to which load can be taken off the Elasticsearch cluster. Managing the loading of sort orders can carried out asynchronously if desired, an aside from the usual flow of an end user request.
Disadvantages include the potential size of the memory footprint and management thereof; for significant use in any application, where many end users are making many varied high-offset requests, this would certainly require a shared cache service to be set up rather than put the burden on individual application servers.