Elasticsearch exclude top hit on field value

Question

{'country': 'France', 'collected': '2018-03-12', 'active': true}
{'country': 'France', 'collected': '2018-03-13', 'active': true}
{'country': 'France', 'collected': '2018-03-14', 'active': false}
{'country': 'Canada', 'collected': '2018-02-01', 'active': false}
{'country': 'Canada', 'collected': '2018-02-02', 'active': true}

Let's say I have this resultset, and I want to group them by country. After grouping them by country this will be the result:

{'country': 'France', 'collected': '2018-03-14', 'active': false}
{'country': 'Canada', 'collected': '2018-02-02', 'active': true}

But I want to exclude results where the last row active is false (the older rows of the same country can be true or false doesn't matter as long as the last row equals true), how can I do that in elasticsearch? Here is my query:

POST /test/_search?search_type=count
{
    "aggs": {
        "group": {
            "terms": {
                "field": "country"
            },
            "aggs": {
                "group_docs": {
                    "top_hits": {
                        "size": 1,
                        "sort": [
                            {
                                "collected": {
                                    "order": "desc"
                                }
                            }
                        ]
                    }
                }
            }
        }
    }
}

Not sure why you say that pipelining didn't work. I would have thought of using something like this (using a `filter` aggregation): — Andrei Stefan, Jul 26 '18 at 13:02
`"aggs": { "filter_inactive": { "filter": { "term": { "active": true } }, "aggs": { "group": { "terms": { "field": "country" }, "aggs": { "group_docs": { "top_hits": { "size": 1, "sort": [ { "collected": { "order": "desc" } } ] } } } } } } }` — Andrei Stefan, Jul 26 '18 at 13:02
you've been given 3 separate solutions, all of them answer the question to varying degrees. You should accept one or give reasons why they are not acceptable. — bryan60, Aug 02 '18 at 13:50
@bryan60 I've given them feedback and edited my question to be more clear, thanks for the heads up — Ismail, Aug 03 '18 at 07:57
OK, i see the update to the question and understand now. What you were after was fairly unclear before — bryan60, Aug 03 '18 at 12:35

score 3 · Answer 1 · answered Jul 26 '18 at 13:09

I think you can get away with sorting by two fields in your top_hits: by active and by collected. Basically, you want trues to be first and when equal, then sort by collected. Something like the following will always show the active:true documents sorted by collected.

The only downside to this solution is that if you don't have any active documents, top_hits will show one active:false document.

{
  "size": 0,
  "aggs": {
    "group": {
      "terms": {
        "field": "country"
      },
      "aggs": {
        "group_docs": {
          "top_hits": {
            "size": 1,
            "sort": [
              {
                "active": {
                  "order": "desc"
                }, 
                "collected": {
                  "order": "desc"
                }
              }
            ]
          }
        }
      }
    }
  }
}

I don't want the last active:true row, i only want the countries where the last row the active column must equal true, if it is false exclude it. This only gives the latest active row, if the second row was active:true and the first was active:false it will return the second row but i only want to return it when the last row is active:true — Ismail, Aug 03 '18 at 07:55

score 0 · Answer 2 · answered Jul 16 '18 at 16:31

0

Why don't you filter them beforehand:

POST /test/_search
{
  "query": {
    "bool": {
      "filter": [
        { 
          "term":{ 
            "active": true 
          }
        }
      ]
    }
  }, 
  "aggs": {
    "group": {
      "terms": {
        "field": "country"
      },
      "aggs": {
        "group_docs": {
            "top_hits": {
              "size": 2,
              "sort": [
                {
                  "collected": 
                  {"order":"desc"}
                }
              ]
            }
          }
        }
      }
    }
  }

This query will group your data by county where active is true.

More info: filter context

answered Jul 16 '18 at 16:31

TechnocratSid

2,245
1
16
26

That won't work, then you aggregrate on filtered data. (For example, if the latest row is active:false, and you filter that. the latest row in the aggregation becomes always active). Which you don't want, you need the active false data to aggregate on, sort, and remove those where the latest is active:false – Ismail Jul 18 '18 at 10:40
Do you want to filter top_hits ? – TechnocratSid Jul 18 '18 at 12:06
Yes – Ismail Jul 18 '18 at 12:29
I don't think so that is possible, the only thing which you can do is adding a sub aggregation before the top_hits to filter your aggregation results. – TechnocratSid Jul 19 '18 at 02:32
Cool can you post the answer ? – TechnocratSid Jul 19 '18 at 10:44
Actually it did not – Ismail Jul 26 '18 at 12:48

score -1 · Answer 3 · answered Aug 02 '18 at 02:01

It sounds like you are looking to filter the aggregation results. I believe you'll need to use a filter bucket for this

Something like:

{
    "aggs": {
        "group": {
            "terms": {
                "field": "country"
            },
            "filter": {
                "term": {
                    "active": true
                }
            },
            "aggs": {
                "group_docs": {
                    "top_hits": {
                        "size": 1,
                        "sort": [
                            {
                                "collected": {
                                    "order": "desc"
                                }
                            }
                        ]
                    }
                }
            }
        }
    }
}

No, this does not work, please read the comment on the other answer https://stackoverflow.com/a/51366352/1904334 — Ismail, Aug 03 '18 at 07:52

bryan60 · Answer 4 · 2018-08-02T04:14:14.637

Generally, you can nest aggregations as needed to achieve any outcome. In this case, adding a filter bucket aggregation in between should achieve the desired outcome.

{
  "size": 0,
  "aggs": {
    "group": {
      "terms": { "field": "country" },
      "aggs": {
        "active_in_group": {
          "filter" : { "term": { "active": true } },
          "aggs": {
            "group_docs": {
              "top_hits": {
                "size": 1,
                "sort": [
                  { "collected": { "order": "desc" } }
                ]
              }
            }
          }
        }
      }
    }
  }
}

Here you have:

Agg level 1 - terms bucket; what is the count of each country in your result set (active or inactive)

Agg level 2 - filter bucket; what is the count of active items within each country bucket

Agg level 3 - top hits; what is the top result (most recently collected, according to your sort) of the active items within each country bucket

As you can see, any nested aggregation always respects the aggregations it is nested within.

One thing I'm unclear on, is if you want the count within each country bucket to reflect only the active items, or also the inactive items, or if you don't care about the counts at all and you're just using the term buckets to get the top hits within each country.

If you want the counts to reflect only the active items, then reverse the term and the filter aggregations, if you want the counts to include active and inactive, keep this order. If you don't care about the counts, the order doesn't matter.

This will of course add a level of aggregation to your results (the count of active items within each country), but that should be easy enough to overcome / ignore when parsing the results.

This solution has been verified to work in elastic 6.X, but I can see you must still be on elastic 1.x for some reason since you're using search_type=count which was deprecated in elastic 2.x. This solution should still work since these specific aggregations haven't changed for some time but I can't verify that there isn't some bug or something that has since been patched since elastic 1.x is very out of date. For future ref, elastic changes a lot from version to version. You generally want to include your version in any questions about elastic and check the version on any answers. In any event, I'd recommend an upgrade if you can.

Elasticsearch exclude top hit on field value

4 Answers4

Linked