When building a webpage and using Elastrisearch to display data stored, you need to consider some other things. The vast information within the index often can’t be handled by the API Gateway at once.
That’s why it’s essential to paginate results so the client can get a predictable and manageable data amount returned every time.
However, before you can start painting the results with the client, you must paginate backend storage data. Most data storage solutions have functions that allow users to paginate, filter, and sort data.
Before we get to all the search requests, search engine configurations, and size parameters, let’s go back and explain some of the simpler concepts before we can talk about Elastic Search pagination and how it works.
What is Elasticsearch?
Elasticsearch is an analytics and distributed search engine. It’s currently one of the most popular search engines used for operational intelligence, text search, business analytics, log analytics, security intelligence, etc. Elasticsearch lets users send data as JSON documents using API or various ingestion tools.
Elasticsearch stores original documents automatically and adds searchable references for them within the cluster’s index. Users can also use the Elasticsearch API to search and retrieve desired documents.
You can also use various visualization tools for building interactive dashboards and visualizing data.
Why use Elasticsearch pagination
When building a webpage that needs to display a large amount of data stored in Elasticsearch, there is so much information in the index that the API Gateway can’t handle. The best solution in this scenario is to paginate results so the client gets predictable data returned every time.
However, before you can paginate your results with the client, you will have to paginate backed storage data. Luckily, most data storage platforms, including Elasticsearch, have various functions allowing users to paginate, filter, and sort data.
Your data structure and requirements are vital to determining what paginating methods you should use. Today, we’ll take a look at multiple pagination methods and explain how they work.
What is Elasticsearch pagination?
Pagination is a known technique used for web presentations. This default search mechanism in Elasticsearch is used for fetching larger data results. When sending queries to Elasticsearch, default values are used to return the first or the most important ten documents.
Limiting the presentation to around five pages is generally a good idea. That helps users know which page they are on, go to the next or the previous page, and select a specific page they want to navigate to. There are three pagination types you should know, including:
- Cursor-based
- Keyset
- Offset
However, since our topic today is Elasticsearch pagination, we shall focus on these three types:
- Scroll pagination
- Search-after pagination
- Pagination
In most applications, ten hits are displayed on the initial page, and there are different options users can “see more.” There’s a button for the next or previous page, the pages are listed, and users can jump on them, or a scrolling option. Now, let’s take a look at each pagination method individually.
Elasticsearch typical pagination
As mentioned earlier, traditional pagination is the default mechanism for fetching more results in Elasticsearch. When sending search queries to Elasticsearch, it uses the default values to send the first or most relevant documents with a maximum of 10 results.
This pagination method sets the upper default limit at 10,000. Pagination doesn’t allow returning over 10,000 documents. However, this configuration can be changed by using the (index.max_result_window) command line. Many developers use this change, but it could cause issues down the road.
The search requests consist of two phases. The first phase follows the initial request, and it’s the query phase. The second phase is the fetch phase.
The query phase (first phase)
Data nodes calculate the scores during the initial phase and match the documents while returning a list of document IDs and a list of scores.
The list is built on the data node and subsequently forwarded to the node in charge of the search request, where it’s sorted and retained in memory. The query score-ID list can become really large over time.
The fetch phase (second phase)
During the second phase, the document JSON Source is fetched from all nodes holding the documents. It ultimately becomes a “Multi Get-request” based on the ID for all the documents that are parts of the pages that need to be returned.
Even though these requests are efficient, all of the query-related information must be kept in memory until the response is sent to the client, and that’s why it’s generally an excellent option to use smaller page sizes.
This method can easily lead to deep pagination, which leads to using up all cluster memory and performance loss.
Deep pagination and why avoid it
Deep pagination leads to extensive memory usage, causes cluster latency, and disrupts the performance of your cluster overall. With this approach, you’re allowing access to all pages. That might sound good in theory, but even Google has a limited number of pages displayed.
Elasticsearch always recalculates hits while sorting and storing the whole Score-ID list within the memory. Instead, the focus should be on providing relevant scoring, filters, and UI to make your users happy with the results they get on the first page.
In other words, a single request and page result should be enough for most people.
Search_after pagination
The search_after pagination is ideal for applications where you don’t need to jump from specific pages to other pages or when you use infinite scrolling. The search_after pagination lets you tell Elasticsearch which was the last hit viewed so that it can ignore all previous hits.
Rather than storing the entire score-ID list for the request within the memory and having to perform actions like sorting so that the right page results can be provided, this method uses a tiebreaker from the previous hit on the last search request.
With this search method, you can show many hits efficiently. With search_after, it’s possible to show over 10,000 hits without worrying about memory usage or using pre-calculated pages.
Live index updates and pagination
Elasticsearch is good for supporting live index updates without causing performance issues. It’s easy to add documents, delete, or update while performing the same-index queries.
Even though these key concepts can be really useful, they could lead to inconsistencies with search result pages when pagination is involved.
For example, if you insert a document relevant to the query and the user clicks on the second page, they will likely get the last document viewed displayed on the top. Search_after and pagination are stateless, meaning there’s no guarantee the order of the search results will be the same when users change pages.
You need to use stateful pagination to ensure that the search experience is consistent after a specific number of repetitions. That brings us to the next search method called Point in Time API.
Using the Point in Time API with pagination or search_after
You can use the Point in Time API to extend your Search_After or traditional pagination and turn them into stateful functions. Users will always get the same index version after a certain time.
All the updates will be sidelined or ignored, or at least users won’t notice them, meaning their search experience will be consistent. Users won’t see random documents showing up when navigating back and forth across pages.
Scroll pagination and scroll API
You can use Scroll API for iterating many documents that match a query and sometimes even all documents that are matching. Even though this API has the name Scroll API, you should never use it for implementing infinite scrolling. At the same time, it shouldn’t be used for frequent end-user requests.
This search operation can handle a scroll request, but Scroll API is completely stateful, meaning all index updates are ignored during the scroll request. Elasticsearch must store the snapshot of the current index version and keep it alive during the scroll’s lifespans to achieve this function.
Actively updated indexes can have difficulties keeping the initial search context alive. The scroll API can be used for retrieving a broad collection of documents with a single request.
You will need a scroll_id for the scroll API, and you can get it by adding a specific scroll argument within query requests.
Conclusion
To sum things up in a simple manner, you should use traditional pagination whenever you need to access pages freely, and you have no need for deep pagination.
The search_after method is the best solution when you and your users want to use the “next” & “previous” buttons, and there’s a need for wide access to multiple pages.
If you need consistency across search result pages, it’s best to use Point in Time API, and the Scroll method is ideal when you want to list all query hits or when you need consistent ordering across multiple search result pages.
We hope our post has helped you understand Elasticsearch pagination and how to choose the best method for your needs.
Barbora does magic with words in Luigi's Box as a product marketing specialist. She got into writing while studying at university as a volunteer for various civic associations. Besides being part of Luigi's Box marketing team, she co-organizes the TEDxBratislava conference, where she cares about marketing and PR.
More blog posts from this author