0

I'm trying to reIndex ElasticSearch, I used Scan and Bulk API, but it's very slow, how can I parallel the process to make it faster. My python code as following:

actions=[]
for hit in helpers.scan(es,scroll='20m',index=INDEX,doc_type=TYPE,params=
     {"size":100}):
    value= hit.get('_source')
    idval = hit.get('_id')
    action = indexAction(INDEX_2,TYPE_2,idval,value)
    actions.append(action)
    count+=1
    if(count%200==0):
        helpers.bulk(es, actions,stats_only=True,params=
        {"consistency":"one","chunk_size":200})
        actions=[]

Should I make the scan multiple process or should I make the bulk multiple processes. I've bean wandering how does ElasticSearch-Hadoop to implement this. My index has 10 nodes and 20 shards.

Jack
  • 5,540
  • 13
  • 65
  • 113

1 Answers1

0

On the elasticsearch side things are parallel. You are getting hits from each shard. But you can always add some clauses to your query and simply run multiple searches concurrently. For example a date range or numeric/alphabetical range might work well for this.

BTW. Since you are using python, your mileage may vary doing things concurrently with threads. I've had good experience forking of processes instead of threads with python. There used to be issues with e.g. having a global lock on the interpreter in python.

Jilles van Gurp
  • 7,927
  • 4
  • 38
  • 46
  • Thank you Jilles, I modified my question, could you please help me again. – Jack Aug 25 '16 at 22:15
  • My python is a little rusty but you should be able to use python's mutliprocessing to fire off bulk index requests from several processes. https://docs.python.org/2/library/multiprocessing.html and you can break up the query like I suggest above. – Jilles van Gurp Aug 26 '16 at 11:45
  • I know how to multi process bulk, but the problem is that the scan took about half of time. for example, the whole reindex process took 10 hours, when I removed bulk part and only execute scan part, it would take 5 hours. So the problem is that how can we expedite the scan speed. – Jack Aug 26 '16 at 14:01
  • As I suggested break the query up and do multiple queries. Also check what is the bottleneck. It might be that you are actually bottlenecked on network IO rather than CPU. If your python thread is using 100% CPU, more threads might help. Another thing to look at is the json parser. As I recall, there are several and there are some pretty steep performance differences between them: http://stackoverflow.com/questions/706101/python-json-decoding-performance – Jilles van Gurp Aug 26 '16 at 14:43
  • break query up is kind of hard. Because ES give different result set or order each time query even though your criterion was same. So it might cause missing some data. – Jack Aug 26 '16 at 16:12