0

I am using Python and Elasticsearch to process large amounts of data. Using the Search API, a response will contain the requested documents in a list "hits":

{
  ...
  "hits" : {
    ...
    "hits" : [
      { "_source": {...} },
      { "_source": {...} },
      { "_source": {...} }
    ]
  ...
}

However, each document is embedded in an _source field, rather than being the raw document I wish (and expected) Elasticsearch would give me. In order for this information to be usable for me, I need to extract every document from each hits.source field into a new list like this:

hits = es_response.get("hits").get("hits")
    items = []
    for hit in hits:
        items.append(hit.get("_source"))
    return {
        "items": items
    }

Optimally, I would prefer to not have to extract each document from the response into a list. Is there a way to configure Elasticsearch to respond with the document data NOT nested in _source? If not, is my solution the best way of getting around this? I was thinking of using Python generators, but need to see if they better fit my use case (I believe they can be slower but use less memory).

Note: I am aware of Elasticsearch's filter_path parameter that allows you to ONLY return the _source field (The response example above assumes usage of this feature), but each document is still embedded within its own _source field and needs to be extracted to an upper layer. Therefore, the question does not match previously-asked questions on this topic.

mswank
  • 103
  • 1
  • 9
  • this answer might help: https://stackoverflow.com/questions/43772834/need-to-return-source-fields-only-without-any-metadata-how-to-use-plugin/43774728#43774728 or you can use [elasticdump](https://github.com/taskrabbit/elasticsearch-dump) with the `sourceOnly` option – Val Jul 08 '19 at 17:46
  • You could also use [`jq`](https://stedolan.github.io/jq/) if you are working with scripts in a terminal. – Nikolay Vasiliev Jul 10 '19 at 15:23

0 Answers0