0

I am trying to fetch and process all entries in elastic using elasticsearch in python. There are approx. 60M records and the issue I have is that when I increase the size above 1M it starts returning nothing.

from elasticsearch import Elasticsearch

es = Elasticsearch("1.1.1.1:1234")

res = es.search(body={
  "from": 0,
  "size": 10000,
  "query": {
    "bool": {
      "must": [
        {
          "query_string": {
            "query": "_exists_:my_string",
            "fields": []
          }
        }
      ],
      "filter": [
        {
          "bool": {
            "must": [
              {
                "range": {
                  "timestamp": {
                    "from": "2019-11-01 01:45:00.000",
                    "to": "2019-11-05 07:45:00.300",
                  }
                }
              }
            ]
          }
        }
      ]
    }
  }
})


print("%d documents found" % res['hits']['total'])

I want to convert the results (basically JSON) to pandas data frame. This works well, but I am struggling how to either fetch all records at once or do this in iterations.

Tomas Greif
  • 21,685
  • 23
  • 106
  • 155
  • 2
    https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-body.html#request-body-search-from-size max size is 10000. You have to use pagination and loop on it in order to retrieve all your records. – LeBigCat Nov 05 '19 at 09:26
  • have a look here https://stackoverflow.com/questions/49320599/elastic-search-not-giving-data-with-big-number-for-page-size/49321145#49321145 – Lupanoide Nov 05 '19 at 10:25

1 Answers1

1

Pagination is a very costly process in distributed systems like elasticsearch. There is a limit for the size+offset parameters set to 10,000 by default. To fetch all records for processing, you can use Scroll API.

https://www.elastic.co/guide/en/elasticsearch/reference/7.1/search-request-scroll.html

It takes a snapshot in time of the index, and returns a cursor ID which you can keep passing in your subsequent requests, to fetch the next batch.

Archit Saxena
  • 1,527
  • 13
  • 26