9

i am trying to index a csv file with 6M records to elasticsearch using python pyes module,the code reads a record line by line and pushes it to elasticsearch...any idea how i can send this as bulk?

import csv
from pyes import *
import sys

header = ['col1','col2','col3','col3', 'col4', 'col5', 'col6']

conn = ES('xx.xx.xx.xx:9200')

counter = 0

for row in reader:
    #print len(row)
    if counter >= 0:
        if counter == 0:
            pass
        else:
            colnum = 0
            data = {}
            for j in row:
                data[header[colnum]] = str(j)
                colnum += 1
            print data
            print counter
            conn.index(data,'accidents-index',"accidents-type",counter)
    else:
        break

    counter += 1
krisdigitx
  • 7,068
  • 20
  • 61
  • 97
  • Similar question over here http://stackoverflow.com/questions/9002982/elasticsearch-bulk-index-in-chunks-using-pyes?rq=1 – Aidan Kane Oct 09 '13 at 14:55
  • based on my investigation, sending 6M records in bulk is not going to be efficient... – krisdigitx Oct 11 '13 at 10:56
  • its better to use a message queuing server.... – krisdigitx Feb 27 '15 at 23:45
  • http://stackoverflow.com/questions/20288770/how-to-use-bulk-api-to-store-the-keywords-in-es-by-using-python works, Not using "pyes", but the more robust library "elasticsearch" http://elasticsearch-py.readthedocs.io/en/master/index.html – Glen Thompson Dec 15 '16 at 00:09

1 Answers1

12

pyelasticsearch supports bulk indexing:

bulk_index(index, doc_type, docs, id_field='id', parent_field='_parent'[, other kwargs listed below])

For example,

cities = []
for line in f:
    fields = line.rstrip().split("\t")
    city = { "id" : fields[0], "city" : fields[1] }
    cities.append(cities)
    if len(cities) == 1000:
        es.bulk_index(es_index, "city", cities, id_field="id")
        cities = []
if len(cities) > 0:
    es.bulk_index(es_index, "city", cities, id_field="id")
fwind
  • 1,274
  • 4
  • 15
  • 32
kielni
  • 4,779
  • 24
  • 21
  • 2
    @krisdigitx I don't see why is this approach not good for 6M of data. Adjust the number of documents in chunk for best performance and you are fine. Bulks of 1000 documents each are a good starting point. – Alexey Tigarev Dec 31 '15 at 03:58
  • What's the limit of number of documents on bulk? If I push it to 10,000, will bulk be able to handle that? If not, would it be able to adapt and break that 10,000 into chunks? – Soubriquet Oct 07 '16 at 14:17