I am performing terms aggregation on documents stored in an index. My documents are products and I am aggregating product's brand name.
# GET /products/_search/
{
"query": {
"match": { "name": "iphone 5" }
},
"aggs": {
"brands_name": {
"terms": {
"field": "brand",
"size": 10
}
}
}
}
Results are, as expected, a bucket of brand names and their doc_counts
.
{
"aggregations": {
"brands_name": {
"doc_count_error_upper_bound": 577,
"sum_other_doc_count": 239924,
"buckets": [
{
"key": "Irrelevant Brand 1",
"doc_count": 8539
},
{
"key": "Irrelevant Brand 2",
"doc_count": 7616
},
...
]
}
}
}
The number of hits can be quite high for generic searches. In my case, only the first results with high score are relevants. As the aggregation runs on all the hits (even the one with low scores) common brands have the tendency to always be present in the buckets list (their doc_count
is high) while they may not be the one corresponding to the relevant results.
I want to push what I consider to be the relevant brands on top of the buckets.
My idea is to to scope the aggregations to only the first n
documents (it could be n
per results or per shards, it does not matter). I did not yet succeed to do it.
I tried different approaches that are not working for me:
- using a filtered query with a limit filter. It does not works as it may exclude documents with high score
- use min_score. While this allow to run the aggregation on a scope containing only high scores, this is really not flexible.
- top_hits aggregation. It does not allow sub-queries, which makes it not possible to run a terms aggregation on the top-hits.
- aggregate results by score with an histogram filter: it could work by splitting results by small score interval and then reducing the results until reaching approximately
n
documents. But it feels a bit dirty and elasticsearch does not seems to support decimal intervals yet