0

I am using solrj to do partial updates of documents.

See: solrj api for partial document update

In my use case, I am trying to implement a reusable "update by query" library.

So the inputs are updateByQuery(SolrQuery query, Map fieldsToUpdate)

So the algorithm is to use the solr query to obtain the id field of each matching document, then I build up batches of these update documents:

[
  {"id":"myidhere", "my_update_field_b":{"set": false}},
  {"id":"myidhere2", "my_update_field_b":{"set": true}},
...
]

So it stores up batches of N of these SolrInputDocuments and submits them when the batch fills up using solrClient.add(Collection<SolrInputDocument>).

The performance of this seems to be several times slower than if I simply re-inserted the entire document, because the documents are quite small to begin with.

Is there some way to use Solr streaming expressions or Spark Solr to do this in an optimal and distributed fashion?

I feel like there is likely some way to leverage parallelism of the solr cluster to do this faster.

Anyone know how to do it better?

Nicholas DiPiazza
  • 10,029
  • 11
  • 83
  • 152
  • 1
    The reason for this being slower is that internally Solr will have to do exactly the same thing you're already doing, fetch the document, change the value of the single field and then reindex the document. Since you've already retrieved the document and simply change the value yourself, Solr can skip two of the three required operations. You can submit documents from many threads at the same time and let the different nodes handle them, and you can make sure you can use in-place updates where possible. You can also use an external file field if your use case matches their limitations. – MatsLindh May 07 '22 at 13:59
  • yeah parallelism made the partial updates much faster. for whatever reason, inserting with single thread is a lot slower than partial updating with a single thread. – Nicholas DiPiazza May 10 '22 at 20:12

0 Answers0