May | 2021 | Tony Yan

Hi, There:

Finally it gets a bit like Spring in the northeast. Also, a friendly reminder – tax return due day is 5/17/21.

So Solr has a built in pagination feature with parameters like start (starting document) and rows (page size). It is intuitively easy to understand and nice to use. But it doesn’t come without some caveats. Imagine what Solr is doing behind the theme. Solr first sorts the matched documents in memory. Then it moves to the starting document using the parameter start; then fetch the next rows number of document before returning to the client. The key point here is Solr sorts and stores all the documents in its memory, at least up to the point where it includes the documents based on the start and rows.

Most of the use cases for Solr is to search small amount of hit documents based on custom queries, and return the top matched documents by score or sort. Rarely in real applications, people cares about anything after 10,000 documents. But in the case when client wants to do very “deep” pagination, ex, trying to get 1,000 documents starting from 1,000,000th, Solr needs to sort and save all of the 1,001,000 docs in memory. That’s when the query takes huge amount of time to run and often cause OutOfMemoryError. This could bring Solr service down if not protected by some layer.

I did some tests on my local Solr core with 11+ million docs and 512 MB of JVM memory. I altered the two parameters, start and rows. The result is shown here:

Both start and rows have tremendous impact on the query time, independently. And they seem to have similar rate of impact (I am not to prove it here). More importantly, though not shown in the graph, my Solr got OOM when either start and rows reaches above 2,000,000. Therefore, deep pagination is both dangerous (OOM) and impractical (slowness).

So how to deal with this problem if client wants to get deep paginated documents. In my testing case, for example, to get anything after 5,000,000th document? The solution Solr provides is CursorMark. Let’s see some code first.

int rows = 1000000;
SolrQuery query = new SolrQuery();
query.setQuery("*:*");
query.setRows(rows);
query.setSort("id", ORDER.asc);
		
String cursorMark = CursorMarkParams.CURSOR_MARK_START;
int count = 0;
boolean noMore = false;
while (!noMore) {

	long start = System.currentTimeMillis();		
	query.set(CursorMarkParams.CURSOR_MARK_PARAM, cursorMark);
	QueryResponse response = solr.query(query);
	String nextCursorMark = response.getNextCursorMark();
	int pageSize = response.getResults().size();
	count += pageSize;
	if (cursorMark.equals(nextCursorMark)) noMore = true;
    cursorMark = nextCursorMark;
	System.out.println("Total " + count + ". This page has " + pageSize + " costs " + (System.currentTimeMillis() - start) + " ms. Next Cusor " + cursorMark);
}

Solr provides you with a CursorMark to help you progressively go deeper and deeper based on your query. If you have a consistent sort parameter, every time when rows of documents are returned, there is a nextCursorMark in the response. This nextCursorMark needs to be returned back to the Solr in the next query. Solr will use it to “calculate” which document to skip and start with the one after it for the next batch of rows. In essence, one can think of this nextCursorMark as a filter-query-like parameter, Solr will “filter” out all documents that has the calculated property “less than and equal to” this CursorMark. It goes this way one page by page, and when no more documents are in the page, Solr will return the last nextCursorMark again, to let the client know there are no more documents.

Output of the SolrJ code:

Total 1000000. This page has 1000000 costs 71220 ms. Next Cusor AoE/BTE3MmJiOWViLTg3Y2MtNGJjMy04YzMzLWNlYWY0MzIxZDk0Mw==
Total 2000000. This page has 1000000 costs 68629 ms. Next Cusor AoE/BTJlNGNhYmU4LTVjMjYtNGYwMy05MTUxLTA2YTNlNzk5ZGQ0Yw==
Total 3000000. This page has 1000000 costs 64734 ms. Next Cusor AoE/BTQ1NzBjYjIwLWJjNWUtNGU1MC04Mjc5LTg5YzM5Zjk2MTdmMg==
Total 4000000. This page has 1000000 costs 63366 ms. Next Cusor AoE/BTVjOTllZDExLWJmNTEtNGVjMS1hMDNkLTYwN2RkMTIzMTQ1YQ==
Total 5000000. This page has 1000000 costs 62166 ms. Next Cusor AoE/BTczYzY0MmRkLTgxYWItNGJjNy05YmFiLWYwNWZkNDgyODBhNw==
Total 6000000. This page has 1000000 costs 61337 ms. Next Cusor AoE/BThhZjEwY2UyLTI0YWYtNGNmZS04ZmUxLWFiMWFjMWQ2Zjk5Zg==
Total 7000000. This page has 1000000 costs 57545 ms. Next Cusor AoE/BWEyMTE0NTE1LTNjMjEtNGU0ZS05MjA2LTAyYTM5MTY3MjM1MA==
Total 8000000. This page has 1000000 costs 54268 ms. Next Cusor AoE/BWI5M2M1ODZkLWQwNGYtNDA2ZC1iYmRhLWQ5ZGE5NjRiY2E2Mw==
Total 9000000. This page has 1000000 costs 53344 ms. Next Cusor AoE/BWQwNzIxZTM5LTJkZGEtNGE4NS1hOWY5LTNmM2VlNmJlZGFhMA==
Total 10000000. This page has 1000000 costs 49601 ms. Next Cusor AoE/BWU3YTY1NDcxLTk5MWQtNDgyNi05Y2NlLWEyYWI1YTQ0NjhmZg==
Total 11000000. This page has 1000000 costs 40592 ms. Next Cusor AoE/BWZlY2MzY2JmLTJlMGQtNGEwYy1iODVkLTQ5NzZmYmFiMjA1MQ==
Total 11051795. This page has 51795 costs 2202 ms. Next Cusor AoE/BWZmZmZmZjk4LWUyZDgtNGU3Zi1hYTllLTM4NmViZTBiNWY4Ng==
Total 11051795. This page has 0 costs 318 ms. Next Cusor AoE/BWZmZmZmZjk4LWUyZDgtNGU3Zi1hYTllLTM4NmViZTBiNWY4Ng==

Each query for the 1,000,000 rows is very consistent in query time, around 4-7 seconds. This is because each query is basically an independent query. Also note the last two query has the same NextCursorMark, indicating the end of the documents.

This kind of cursor controlled pagination saves query time, since it sorts less number of documents (in almost like divider and conquer fashion). More critically, there will be no OOM, unless the rows parameter is outrageously too big.

One more requirement is the sorting field has to be unique in the collection. Why? this is because if the sorting field is not unique, there is some chance (depends on the occurrence rate of duplicated sort field) that the cursorMark is calculated from a document with one of the duplicated fields. Then the next query will not know which documents to skip. For example, if you want to paginate on documents based on first name, then if a common name like “tony” happens to be used to calculate the nextCursorMark, Solr can’t decide which “tony” to skip in the next query. Solr can certainly skip all “tony” documents, but then the query result will lose documents that should have been returned.

Deeper pagination is not that commonly needed in biz use cases, in my opinion. But if must, use Cursor Mark to speed up the query and more importantly, not to bring Solr to its knees by OOM.

Cheers!

	Ben Northrop on ApacheXMLDSig Signature Provid…
	sriramsundar sm on Enabled TLS 1.2 and TLS 1.1 on…
	java - Devenir root… on java.lang.AbstractMethodError:…
	java - Sempre causa… on java.lang.AbstractMethodError:…
	Sundarapandian Balas… on Enabled TLS 1.2 and TLS 1.1 on…

Tony Yan

Passion and Imagination at Work

Month: May 2021

Deep pagination’s slowness and OOM in Solr