0

I have several million history objects that I need to save to Elasticsearch. What would be the best way to do this, without going into the internals of elasticsearch? Here is the pattern I'm currently using:

ACTIONS = []
NUM_ACTIONS_TO_BULK = 10000
for num, item in enumerate(HISTORY_DATA.values()):
    ACTIONS.append({
        "_index": ES_INDEX_NAME,
        "_type": "_doc",
        "_id": item.pop('_id'),
        "_source": item
    })

    # Save every 10k and again at the end
    if (len(ACTIONS) == NUM_ACTIONS_TO_BULK) or (num == len(HISTORY_DATA) - 1):
        log.info('%s/%s - Saving %s items to ES...' % (num, len(HISTORY_DATA), len(ACTIONS))
        _ = helpers.bulk(self.es, ACTIONS)
        ACTIONS = []

The above saves it to ES in batches of 10k. Is this the best/most efficient way to save things to ES? For example, what if I tried saving all 15M objects directly to ES using helpers.bulk -- does that chunk the items, or does it try saving it all at once? Does it look like I'm missing anything in the above?

David542
  • 104,438
  • 178
  • 489
  • 842

1 Answers1

0

A couple of things to try when using the Bulk API. Play around with the number of docs you send at a time. It all depends upon the average doc size. There are some good suggested starting points listed here. Sometimes 100 at a time will be faster than 1000 at a time. Also, if you can make your application multithreaded, do it. If you have multiple nodes that can write then take advantage of that.

Tim
  • 1,276
  • 11
  • 23