10
{'country': 'France', 'collected': '2018-03-12', 'active': true}
{'country': 'France', 'collected': '2018-03-13', 'active': true}
{'country': 'France', 'collected': '2018-03-14', 'active': false}
{'country': 'Canada', 'collected': '2018-02-01', 'active': false}
{'country': 'Canada', 'collected': '2018-02-02', 'active': true}

Let's say I have this resultset, and I want to group them by country. After grouping them by country this will be the result:

{'country': 'France', 'collected': '2018-03-14', 'active': false}
{'country': 'Canada', 'collected': '2018-02-02', 'active': true}

But I want to exclude results where the last row active is false (the older rows of the same country can be true or false doesn't matter as long as the last row equals true), how can I do that in elasticsearch? Here is my query:

POST /test/_search?search_type=count
{
    "aggs": {
        "group": {
            "terms": {
                "field": "country"
            },
            "aggs": {
                "group_docs": {
                    "top_hits": {
                        "size": 1,
                        "sort": [
                            {
                                "collected": {
                                    "order": "desc"
                                }
                            }
                        ]
                    }
                }
            }
        }
    }
}
Ismail
  • 8,904
  • 3
  • 21
  • 39
  • 1
    Not sure why you say that pipelining didn't work. I would have thought of using something like this (using a `filter` aggregation): – Andrei Stefan Jul 26 '18 at 13:02
  • `"aggs": { "filter_inactive": { "filter": { "term": { "active": true } }, "aggs": { "group": { "terms": { "field": "country" }, "aggs": { "group_docs": { "top_hits": { "size": 1, "sort": [ { "collected": { "order": "desc" } } ] } } } } } } }` – Andrei Stefan Jul 26 '18 at 13:02
  • you've been given 3 separate solutions, all of them answer the question to varying degrees. You should accept one or give reasons why they are not acceptable. – bryan60 Aug 02 '18 at 13:50
  • @bryan60 I've given them feedback and edited my question to be more clear, thanks for the heads up – Ismail Aug 03 '18 at 07:57
  • OK, i see the update to the question and understand now. What you were after was fairly unclear before – bryan60 Aug 03 '18 at 12:35

4 Answers4

3

I think you can get away with sorting by two fields in your top_hits: by active and by collected. Basically, you want trues to be first and when equal, then sort by collected. Something like the following will always show the active:true documents sorted by collected.

The only downside to this solution is that if you don't have any active documents, top_hits will show one active:false document.

{
  "size": 0,
  "aggs": {
    "group": {
      "terms": {
        "field": "country"
      },
      "aggs": {
        "group_docs": {
          "top_hits": {
            "size": 1,
            "sort": [
              {
                "active": {
                  "order": "desc"
                }, 
                "collected": {
                  "order": "desc"
                }
              }
            ]
          }
        }
      }
    }
  }
}
Andrei Stefan
  • 51,654
  • 6
  • 98
  • 89
  • I don't want the last active:true row, i only want the countries where the last row the active column must equal true, if it is false exclude it. This only gives the latest active row, if the second row was active:true and the first was active:false it will return the second row but i only want to return it when the last row is active:true – Ismail Aug 03 '18 at 07:55
0

Why don't you filter them beforehand:

POST /test/_search
{
  "query": {
    "bool": {
      "filter": [
        { 
          "term":{ 
            "active": true 
          }
        }
      ]
    }
  }, 
  "aggs": {
    "group": {
      "terms": {
        "field": "country"
      },
      "aggs": {
        "group_docs": {
            "top_hits": {
              "size": 2,
              "sort": [
                {
                  "collected": 
                  {"order":"desc"}
                }
              ]
            }
          }
        }
      }
    }
  }

This query will group your data by county where active is true.

More info: filter context

TechnocratSid
  • 2,245
  • 1
  • 16
  • 26
-1

It sounds like you are looking to filter the aggregation results. I believe you'll need to use a filter bucket for this

Something like:

{
    "aggs": {
        "group": {
            "terms": {
                "field": "country"
            },
            "filter": {
                "term": {
                    "active": true
                }
            },
            "aggs": {
                "group_docs": {
                    "top_hits": {
                        "size": 1,
                        "sort": [
                            {
                                "collected": {
                                    "order": "desc"
                                }
                            }
                        ]
                    }
                }
            }
        }
    }
}
Brandon Novick
  • 141
  • 2
  • 6
  • No, this does not work, please read the comment on the other answer https://stackoverflow.com/a/51366352/1904334 – Ismail Aug 03 '18 at 07:52
-1

Generally, you can nest aggregations as needed to achieve any outcome. In this case, adding a filter bucket aggregation in between should achieve the desired outcome.

{
  "size": 0,
  "aggs": {
    "group": {
      "terms": { "field": "country" },
      "aggs": {
        "active_in_group": {
          "filter" : { "term": { "active": true } },
          "aggs": {
            "group_docs": {
              "top_hits": {
                "size": 1,
                "sort": [
                  { "collected": { "order": "desc" } }
                ]
              }
            }
          }
        }
      }
    }
  }
}

Here you have:

Agg level 1 - terms bucket; what is the count of each country in your result set (active or inactive)

Agg level 2 - filter bucket; what is the count of active items within each country bucket

Agg level 3 - top hits; what is the top result (most recently collected, according to your sort) of the active items within each country bucket

As you can see, any nested aggregation always respects the aggregations it is nested within.

One thing I'm unclear on, is if you want the count within each country bucket to reflect only the active items, or also the inactive items, or if you don't care about the counts at all and you're just using the term buckets to get the top hits within each country.

If you want the counts to reflect only the active items, then reverse the term and the filter aggregations, if you want the counts to include active and inactive, keep this order. If you don't care about the counts, the order doesn't matter.

This will of course add a level of aggregation to your results (the count of active items within each country), but that should be easy enough to overcome / ignore when parsing the results.

This solution has been verified to work in elastic 6.X, but I can see you must still be on elastic 1.x for some reason since you're using search_type=count which was deprecated in elastic 2.x. This solution should still work since these specific aggregations haven't changed for some time but I can't verify that there isn't some bug or something that has since been patched since elastic 1.x is very out of date. For future ref, elastic changes a lot from version to version. You generally want to include your version in any questions about elastic and check the version on any answers. In any event, I'd recommend an upgrade if you can.

bryan60
  • 28,215
  • 4
  • 48
  • 65