elasticsearch-py scan and scroll to return all documents

Question

I am using elasticsearch-py to connect to my ES database which contains over 3 million documents. I want to return all the documents so I can abstract data and write it to a csv. I was able to accomplish this easily for 10 documents (the default return) using the following code.

es=Elasticsearch("glycerin")
query={"query" : {"match_all" : {}}}
response= es.search(index="_all", doc_type="patent", body=query)

for hit in response["hits"]["hits"]:
  print hit

Unfortunately, when I attempted to implement the scan & scroll so I could get all the documents I ran into issues. I tried it two different ways with no success.

Method 1:

scanResp= es.search(index="_all", doc_type="patent", body=query, search_type="scan", scroll="10m")  
scrollId= scanResp['_scroll_id']

response= es.scroll(scroll_id=scrollId, scroll= "10m")
print response

enter image description here After scroll/ it gives the scroll id and then ends with ?scroll=10m (Caused by <class 'httplib.BadStatusLine'>: ''))

Method 2:

query={"query" : {"match_all" : {}}}
scanResp= helpers.scan(client= es, query=query, scroll= "10m", index="", doc_type="patent", timeout="10m")

for resp in scanResp:
    print "Hiya"

If I print out scanResp before the for loop I get <generator object scan at 0x108723dc0>. Because of this I'm relatively certain that I'm messing up my scroll somehow, but I'm not sure where or how to fix it.

Results: enter image description here Again, after scroll/ it gives the scroll id and then ends with ?scroll=10m (Caused by <class 'httplib.BadStatusLine'>: ''))

I tried increasing the Max retries for the transport class, but that didn't make a difference.I would very much appreciate any insight into how to fix this.

Note: My ES is located on a remote desktop on the same network.

score 11 · Accepted Answer · answered Apr 08 '14 at 14:59

11

The python scan method is generating a GET call to the rest api. It is trying to send over your scroll_id over http. The most likely case here is that your scroll_id is too large to be sent over http and so you are seeing this error because it returns no response.

Because the scroll_id grows based on the number of shards you have it is better to use a POST and send the scroll_id in JSON as part of the request. This way you get around the limitation of it being too large for an http call.

answered Apr 08 '14 at 14:59

chrstahl89

580
10
21

4

This is in fact where the error was coming from. Turns out they fixed this awhile back, however so a simple pip install --update elasticsearch was the official answer to the problem. [make Elasticsearch.scroll POST the scroll ID](https://github.com/elasticsearch/elasticsearch-py/pull/28) – drowningincode Apr 08 '14 at 19:07
Make sure you are using an up to date version of Elasticsearch. My problem was I was using a version from before they fixed this – drowningincode Sep 22 '14 at 18:30
presumably you mean `pip install --upgrade elasticsearch` – travelingbones Sep 23 '16 at 00:31

score 2 · Answer 2 · edited Apr 16 '17 at 07:05

2

Do you issue got resolved ?

I have got one simple solution, you must change the scroll_id every time after you call scroll method like below :

response_tmp = es.scroll(scroll_id=scrollId, scroll= "1m")

scrollId = response_tmp['_scroll_id']

edited Apr 16 '17 at 07:05

dildeepak

1,349
2
16
34

answered Apr 16 '17 at 02:38

zhaochl

29
2

elasticsearch-py scan and scroll to return all documents

2 Answers2

Linked