How to get specific _source fields in aggregation

Question

I am exploring ElasticSearch, to be used in an application, which will handle large volumes of data and generate some statistical results over them. My requirement is to retrieve certain statistics for a particular field. For example, for a given field, I would like to retrieve its unique values and document frequency of each value, along-with the length of the value. The value lengths are indexed along-with each document. So far, I have experimented with Terms Aggregation, with the following query:

{
  "size": 0,
  "query": {
  "match_all": {}
},
 "aggs": {
 "type_count": {
   "terms": {
     "field": "val.keyword",
     "size": 100
   }
  }
 }
}

The query returns all the values in the field val with the number of documents in which each value occurs. I would like the field val_len to be returned as well. Is it possible to achieve this using ElasticSearch? In other words, is it possible to include specific _source fields in buckets? I have looked through the documentation available online, but I haven't found a solution yet. Hoping somebody could point me in the right direction. Thanks in advance!

I tried to include _source in the following manners:

 "aggs": {
    "type_count": {
     "terms": {
        "field": "val.keyword",
        "size": 100        
      },
        "_source":["val_len"]
    }
  }

and

"aggs": {
 "type_count": {
   "terms": {
     "field": "val.keyword",
     "size": 100,
      "_source":["val_len"]
    }     
  }
}

But I guess this isn't the right way, because both gave me parsing errors.

Yes you can, "_source": [ "fielda", "fieldb" ], you can also use script on them. — LeBigCat, Feb 12 '19 at 11:56
@LeBigCat I'm getting a parse error on added "_source" to aggregation. — Poonam Anthony, Feb 12 '19 at 12:02

Val · Accepted Answer · 2019-02-12T13:19:09.000

16

You need to use another sub-aggregation called top_hits, like this:

"aggs": {
 "type_count": {
   "terms": {
     "field": "val.keyword",
     "size": 100
    },
    "aggs": {
      "hits": {
        "top_hits": {
          "_source":["val_len"],
          "size": 1
        }
      }
    }
  }
}

Another way of doing it is to use another avg sub-aggregation so you can sort on it, too

"aggs": {
 "type_count": {
   "terms": {
     "field": "val.keyword",
     "size": 100,
     "order": {
       "length": "desc"
     }
    },
    "aggs": {
      "length": {
        "avg": {
          "field": "val_len"
        }
      }
    }
  }
}

edited Feb 12 '19 at 13:19

answered Feb 12 '19 at 12:33

Val

207,596
13
358
360

Thanks for your answer. I did try this out and it gives me the `var_len` field nested within each bucket. However, I need to be able to sort using this field. Is it possible to do so? – Poonam Anthony Feb 12 '19 at 12:41
In that case, you need something else. See my updated answer. – Val Feb 12 '19 at 12:43
I tried out the second query, but I'm getting the following error: `Invalid aggregation order path [length]. Buckets can only be sorted on a sub-aggregator path that is built out of zero or more single-bucket aggregations within the path and a final single-bucket or a metrics aggregation at the path end.` – Poonam Anthony Feb 12 '19 at 13:16
Oops, my bad, you need to use `avg` and not `terms` indeed, sorry about that – Val Feb 12 '19 at 13:19
Thanks a lot. This serves my purpose. If I am not wrong any numeric aggregation can be used in the nested `aggs` block named as `length`, right? – Poonam Anthony Feb 13 '19 at 05:51

How to get specific _source fields in aggregation

1 Answers1