7

I am exploring ElasticSearch, to be used in an application, which will handle large volumes of data and generate some statistical results over them. My requirement is to retrieve certain statistics for a particular field. For example, for a given field, I would like to retrieve its unique values and document frequency of each value, along-with the length of the value. The value lengths are indexed along-with each document. So far, I have experimented with Terms Aggregation, with the following query:

{
  "size": 0,
  "query": {
  "match_all": {}
},
 "aggs": {
 "type_count": {
   "terms": {
     "field": "val.keyword",
     "size": 100
   }
  }
 }
}

The query returns all the values in the field val with the number of documents in which each value occurs. I would like the field val_len to be returned as well. Is it possible to achieve this using ElasticSearch? In other words, is it possible to include specific _source fields in buckets? I have looked through the documentation available online, but I haven't found a solution yet. Hoping somebody could point me in the right direction. Thanks in advance!

I tried to include _source in the following manners:

 "aggs": {
    "type_count": {
     "terms": {
        "field": "val.keyword",
        "size": 100        
      },
        "_source":["val_len"]
    }
  }

and

"aggs": {
 "type_count": {
   "terms": {
     "field": "val.keyword",
     "size": 100,
      "_source":["val_len"]
    }     
  }
}

But I guess this isn't the right way, because both gave me parsing errors.

maksadbek
  • 1,508
  • 2
  • 15
  • 28
Poonam Anthony
  • 1,848
  • 3
  • 17
  • 27

1 Answers1

16

You need to use another sub-aggregation called top_hits, like this:

"aggs": {
 "type_count": {
   "terms": {
     "field": "val.keyword",
     "size": 100
    },
    "aggs": {
      "hits": {
        "top_hits": {
          "_source":["val_len"],
          "size": 1
        }
      }
    }
  }
}

Another way of doing it is to use another avg sub-aggregation so you can sort on it, too

"aggs": {
 "type_count": {
   "terms": {
     "field": "val.keyword",
     "size": 100,
     "order": {
       "length": "desc"
     }
    },
    "aggs": {
      "length": {
        "avg": {
          "field": "val_len"
        }
      }
    }
  }
}
Val
  • 207,596
  • 13
  • 358
  • 360
  • Thanks for your answer. I did try this out and it gives me the `var_len` field nested within each bucket. However, I need to be able to sort using this field. Is it possible to do so? – Poonam Anthony Feb 12 '19 at 12:41
  • In that case, you need something else. See my updated answer. – Val Feb 12 '19 at 12:43
  • I tried out the second query, but I'm getting the following error: `Invalid aggregation order path [length]. Buckets can only be sorted on a sub-aggregator path that is built out of zero or more single-bucket aggregations within the path and a final single-bucket or a metrics aggregation at the path end.` – Poonam Anthony Feb 12 '19 at 13:16
  • Oops, my bad, you need to use `avg` and not `terms` indeed, sorry about that – Val Feb 12 '19 at 13:19
  • Thanks a lot. This serves my purpose. If I am not wrong any numeric aggregation can be used in the nested `aggs` block named as `length`, right? – Poonam Anthony Feb 13 '19 at 05:51