Update nested field for millions of documents

Question

I use bulk update with script in order to update a nested field, but this is very slow :

POST index/type/_bulk

{"update":{"_id":"1"}}
{"script"{"inline":"ctx._source.nestedfield.add(params.nestedfield)","params":{"nestedfield":{"field1":"1","field2":"2"}}}}
{"update":{"_id":"2"}}
{"script"{"inline":"ctx._source.nestedfield.add(params.nestedfield)","params":{"nestedfield":{"field1":"3","field2":"4"}}}}

 ... [a lot more splitted in several batches]

Do you know another way that could be faster ?

It seems possible to store the script in order to not repeat it for each update, but I couldn't find a way to keep "dynamic" params.

@eli0tt May you please clarify couple of questions. Do you insert new arbitrary data every time (when updating)? How fast is the process currently going? What is (roughly) the size of your index: number of records, size on disk? How often do you do such updates (of millions of nested docs)? Can you use [`_reindex`](https://www.elastic.co/guide/en/elasticsearch/reference/5.5/docs-reindex.html)? Thanks! — Nikolay Vasiliev, Oct 19 '17 at 15:53
My index has ~11M documents (3.6G). I need to do a one shot big update which consists in adding records to the same nested field in existing documents. The data is arbitrary. Some documents have no need to be updated, some need to be updated with ~1-3 new records in the nested field. The total number of updates needed is ~12M. I generated files at the bulk format from the data as you can see on my post. The execution of the updates seems to take days... (even with stored script which I also tried). Thanks! — eli0tt, Oct 20 '17 at 09:00

score 3 · Accepted Answer · answered Oct 21 '17 at 08:07

As often with performance optimization questions, there is no single answer since there are many possible causes of poor performance.

In your case you are making bulk update requests. When an update is performed, the document is actually being re-indexed:

... to update a document is to retrieve it, change it, and then reindex the whole document.

Hence it makes sense to take a look at indexing performance tuning tips. The first few things I would consider in your case would be selecting right bulk size, using several threads for bulk requests and increasing/disabling indexing refresh interval.

You might also consider using a ready-made client that supports parallel bulk requests, like Python elasticsearch client does.

It would be ideal to monitor ElasticSearch performance metrics to understand where the bottleneck is, and if your performance tweaks are giving actual gain. Here is an overview blog post about ElasticSearch performance metrics.

Update nested field for millions of documents

1 Answers1

Linked