0

i have been trying to get word frequency in Elasticsearch. I using Elasticsearch Python and Elasticsearch DSL Python Client.

here is my code:

client = Elasticsearch(["my_ip_machine:port"])
s = Search(using=client, index=settings.ES_INDEX) \
                .filter("term",content=keyword)\
                .filter("term",provider=json_input["media"])\
                .filter("range",**{'publish': {"from": begin,"to": end}})
s.aggs.bucket("group_by_state","terms",field="content")
result = s.execute()

i run that code and i get output like this: (i modified the output more concise)

{
"word1": 8,
"word2": 8,
"word3": 6,
"word4": 4,
}

The code run without problem in Elasticsearch with only 2000 document in my laptop. But, got problem when run that code in my Droplet in DO. I have >2.000.000 document in my Elasticsearch and i use Droplet with 1 GB RAM. Every time i run that code, memory usage will increase and Elasticsearch is shutting down.

There is another way (more efficient) to get word frequency in Elasticsearch with large document? Answer in Elasticsearch query is not problem, i will convert to DSL.

Thank you :)

kandito
  • 19
  • 2

1 Answers1

0

When I ran into this problem I had to go here for the answer:

Elasticsearch query to return all records

You need to grab the documents in chunks. Lets say, 2000 at a time. Then, loop over and make multiple queries.

Community
  • 1
  • 1
cybergoof
  • 1,407
  • 3
  • 16
  • 25
  • @robertklep yes, that won't work with aggregations, do you have answer? thanks. – kandito Apr 27 '15 at 07:37
  • @kandito aggregations are memory-hungry. You can read [the docs](http://www.elastic.co/guide/en/elasticsearch/guide/master/fielddata.html) (or [here](http://www.elastic.co/guide/en/elasticsearch/reference/current/fielddata-formats.html), but that's pretty low level) to see if you can limit the amount of RAM required, but ultimately, 1G for 2M docs may just be too little. – robertklep Apr 27 '15 at 07:48