0

I am processing an entire solr index of 80million documents, and I am doing so through pagination.

I learned from here that it is a bad idea to use the parameters start for pagination on very large index like this, instead, I should use cursor marker using code like below:

query.setSort("id", SolrQuery.ORDER.asc);
while (! done) {
  q.set(CursorMarkParams.CURSOR_MARK_PARAM, cursorMark);
  QueryResponse rsp = solrServer.query(q);
  String nextCursorMark = rsp.getNextCursorMark();
  boolean hadEnough = doCustomProcessingOfResults(rsp);
  if (hadEnough || cursorMark.equals(nextCursorMark)) {
    done = true;
  }
  cursorMark = nextCursorMark;
}

However, this requires the query to firstly sort the entire index on the uniqueKey field, which is defined as :

<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />

, the operation of which requires a lot of memory and my computer does not have sufficient memory to deal with that. It generates an 'outofmemory' error.

I wonder if there is any workaround for this? Many thanks in advance.

Ziqi
  • 2,445
  • 5
  • 38
  • 65
  • Make sure you've [enabled `docValues` for your field](https://lucene.apache.org/solr/guide/6_6/docvalues.html). It'll require reindexing the content, but this greatly optimizes sorting of large result sets. How many rows are you on average fetching before the search terminates? (i.e. would it be useful to export the whole result set instead?) – MatsLindh Jun 18 '19 at 19:32
  • Thanks for your reply. I did not enable this and it would be very costly to re-index the data as it took 2 months. No rows were fetched at all before it terminates, with a rows=5000 param. I am currently going back to using the 'start' parameter. My task is to export the entire index, so I am paging on the query '*:*' relying on the fact that the results are always returned in the order of the indexing timestamp. – Ziqi Jun 19 '19 at 06:50
  • If you want to export the complete result set, use the [/export request handler - Exporting result sets](https://lucene.apache.org/solr/guide/6_6/exporting-result-sets.html) - but that might require docValues again. That end point will stream the result to you would by far have the best throughput for exporting results. However, another solution might be that an out of memory error may only mean that your JVM (the java virtual machine) may require more memory - not that your computer doesn't have enough. You can configure this in the script that starts Solr. – MatsLindh Jun 19 '19 at 07:11
  • Indeed I still need docValues for the export request handler. But it appears that I really have no other options than re-indexing... incrementing the 'start' parameter isn't scalable as I am now on start=40mil and it also hit the memory issue. Painful lesson learned... – Ziqi Jun 20 '19 at 12:24
  • A further thought is using Lucene, by going through the entire index using the method described at https://stackoverflow.com/questions/2311845/is-it-possible-to-iterate-through-documents-stored-in-lucene-index, would that be scalable on large index? – Ziqi Jun 20 '19 at 12:38
  • Sure, working directly with the index files is an option. It'd be just as effective as what Solr would have to do internally, without the overhead added by Solr. Create the indexreader and loop through all documents. – MatsLindh Jun 20 '19 at 12:46

1 Answers1

1

Just an update, and many thanks for the input from MatsLindh.

If you face same problem like this, i.e., (1) you want to export an entire index somewhere for some reason; (2) your index is very large, e.g., with tens of millions of records; (3) your did not index them with 'docValues' on any fields, which means you cannot use the more efficient cursor marker, or the export handler; (4) and you don't have enough memory to use the 'start' and 'rows' parameters with solr.

The solution is to use Lucene IndexReader directly to bypass solr. And I can report orders of magnitude of speed improvement. It took just 3 hours to export 90million records using this approach. While before when I was using Solr with 'start' and 'rows', it took >24hours to just export 16million.

Ziqi
  • 2,445
  • 5
  • 38
  • 65