I am using solrj to do partial updates of documents.
See: solrj api for partial document update
In my use case, I am trying to implement a reusable "update by query" library.
So the inputs are updateByQuery(SolrQuery query, Map fieldsToUpdate)
So the algorithm is to use the solr query to obtain the id
field of each matching document, then I build up batches of these update documents:
[
{"id":"myidhere", "my_update_field_b":{"set": false}},
{"id":"myidhere2", "my_update_field_b":{"set": true}},
...
]
So it stores up batches of N
of these SolrInputDocument
s and submits them when the batch fills up using solrClient.add(Collection<SolrInputDocument>)
.
The performance of this seems to be several times slower than if I simply re-inserted the entire document, because the documents are quite small to begin with.
Is there some way to use Solr streaming expressions or Spark Solr to do this in an optimal and distributed fashion?
I feel like there is likely some way to leverage parallelism of the solr cluster to do this faster.
Anyone know how to do it better?