elasticsearch bulk indexing using python

Question

i am trying to index a csv file with 6M records to elasticsearch using python pyes module,the code reads a record line by line and pushes it to elasticsearch...any idea how i can send this as bulk?

import csv
from pyes import *
import sys

header = ['col1','col2','col3','col3', 'col4', 'col5', 'col6']

conn = ES('xx.xx.xx.xx:9200')

counter = 0

for row in reader:
    #print len(row)
    if counter >= 0:
        if counter == 0:
            pass
        else:
            colnum = 0
            data = {}
            for j in row:
                data[header[colnum]] = str(j)
                colnum += 1
            print data
            print counter
            conn.index(data,'accidents-index',"accidents-type",counter)
    else:
        break

    counter += 1

Similar question over here http://stackoverflow.com/questions/9002982/elasticsearch-bulk-index-in-chunks-using-pyes?rq=1 — Aidan Kane, Oct 09 '13 at 14:55
based on my investigation, sending 6M records in bulk is not going to be efficient... — krisdigitx, Oct 11 '13 at 10:56
http://stackoverflow.com/questions/20288770/how-to-use-bulk-api-to-store-the-keywords-in-es-by-using-python works, Not using "pyes", but the more robust library "elasticsearch" http://elasticsearch-py.readthedocs.io/en/master/index.html — Glen Thompson, Dec 15 '16 at 00:09

score 12 · Answer 1 · edited Nov 20 '15 at 17:49

12

pyelasticsearch supports bulk indexing:

bulk_index(index, doc_type, docs, id_field='id', parent_field='_parent'[, other kwargs listed below])

For example,

cities = []
for line in f:
    fields = line.rstrip().split("\t")
    city = { "id" : fields[0], "city" : fields[1] }
    cities.append(cities)
    if len(cities) == 1000:
        es.bulk_index(es_index, "city", cities, id_field="id")
        cities = []
if len(cities) > 0:
    es.bulk_index(es_index, "city", cities, id_field="id")

edited Nov 20 '15 at 17:49

fwind

1,274
4
15
32

answered Oct 10 '13 at 15:58

kielni

4,779
24
21

2

@krisdigitx I don't see why is this approach not good for 6M of data. Adjust the number of documents in chunk for best performance and you are fine. Bulks of 1000 documents each are a good starting point. – Alexey Tigarev Dec 31 '15 at 03:58
What's the limit of number of documents on bulk? If I push it to 10,000, will bulk be able to handle that? If not, would it be able to adapt and break that 10,000 into chunks? – Soubriquet Oct 07 '16 at 14:17

elasticsearch bulk indexing using python

1 Answers1