2

I have a query:

s = Search(using=client, index='myindex', doc_type='mytype')
s.query = Q('bool', must=[Q('match', BusinessUnit=bunit),
                          Q('range', **dicdate)])

res = s.execute()

return me 627033 lines, I want to convert this dictionary in a dataframe with 627033 lines

Náthali
  • 937
  • 2
  • 10
  • 22
  • Can you give more information about the output of ElasticSearch query? If it is simply dictionary, the question should be converting dictionary to dataframe. There are many answers on this for example https://stackoverflow.com/questions/34589332/python-dictionary-to-pandas-dataframe – Nelson Dinh Sep 28 '17 at 15:11
  • actually is not the format of a dictionary that i am searching for, but it always return only 10 elements i want all of them – Náthali Sep 28 '17 at 16:55

3 Answers3

3

If your request is likely to return more than 10,000 documents from Elasticsearch, you will need to use the scrolling function of Elasticsearch. Documentation and examples for this function are rather difficult to find, so I will provide you with a full, working example:

import pandas as pd
from elasticsearch import Elasticsearch
import elasticsearch.helpers


es = Elasticsearch('127.0.0.1',
        http_auth=('my_username', 'my_password'),
        port=9200)

body={"query": {"match_all": {}}}
results = elasticsearch.helpers.scan(es, query=body, index="my_index")
df = pd.DataFrame.from_dict([document['_source'] for document in results])

Simply edit the fields that start with "my_" to correspond to your own values

Phil B
  • 5,589
  • 7
  • 42
  • 58
2

Based on your comment I think what you're looking for is size:

es.search(index="my-index", doc_type="mydocs", body="your search", size="1000")

I'm not sure if this will work for 627,033 lines -- you might need scroll for that.

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-scroll.html

a mark
  • 95
  • 1
  • 9
0

I found the solution by Phil B a good template for my situation. However, all results are returned as lists, rather than atomic data types. To get around this, I added the following helper function and code:

def flat_data(val):
  if isinstance(val):
    return val[0]
  else:
    return val
df = pd.DataFrame.from_dict([{k:flat_data(v) for (k,v) in document(['fields'].items()} 
                            for document in results])
Ardent Coder
  • 3,777
  • 9
  • 27
  • 53
DSJ529
  • 1
  • 1