0

I am doing some queries on an index products I have in elasticsearch. I have the following documents in index products

{ "product_name": "prod-1", "meta": [ { "key": "key1", "value": "value1" }, { "key": "key2", "value": "value2" } ] }
{ "product_name": "prod-2", "meta": [ { "key": "key1", "value": "value1" } ] }
{ "product_name": "prod-2", "meta": [ { "key": "key2", "value": "value2" } ] }
{ "product_name": "prod-3", "meta": [ { "key": "key2", "value": "value2" } ] }

What I want now is to get product_names that have both key1/value1 and key2/value2 in meta array but not necessarily in the same document. For example, in the above data prod-1 has both key1/value1 and key2/value2 in the same document, so I want prod-1 in the result. And prod-2 also has both key1/value1 and key2/value2 but in different documents. I want prod-2 also in the result. prod-3 has only key2/value2 even across documents. So, I don't want to get prod-3 in the results.

I am trying the following apporach

  1. Group by product name
  2. Then filter the aggregation results to check each product has both key1/value1 and key2/value2

I group them by product_name and combine meta fields in each bucket as follows

{
  "size": 0,
  "aggs": {
    "by_product": {
      "terms": {
        "field": "product_name"
      },
      "aggs": {
        "all_meta": {
          "top_hits": {
            "_source": {
              "includes": [
                "meta.key",
                "meta.value"
              ]
            }
          }
        }
      }
    }
  }
}

The result after the above aggregation is effectively the following

  "aggregations" : {
    "by_product" : {
      ...
      "buckets" : [
        {
          ...
          "key" : "prod-2",
          "all_meta" : {
            "hits" : {
              ...
              "hits" : [
                {
                  ....
                  "_source" : {
                    "meta" : [
                      {
                        "value" : "value1",
                        "key" : "key1"
                      }
                    ]
                  }
                },
                {
                  ....
                  "_source" : {
                    "meta" : [
                      {
                        "value" : "value2",
                        "key" : "key2"
                      }
                    ]
                  }
                }
              ]
            }
          }
        },
        {
          ....
          "key" : "prod-1",
          "all_meta" : {
            "hits" : {
              ....
              "hits" : [
                {
                  ....
                  "_source" : {
                    "meta" : [
                      {
                        "value" : "value1",
                        "key" : "key1"
                      },
                      {
                        "value" : "value2",
                        "key" : "key2"
                      }
                    ]
                  }
                }
              ]
            }
          }
        },
        {
          ....
          "key" : "prod-3",
          "all_meta" : {
            "hits" : {
              ....
              "hits" : [
                {
                  ....
                  "_source" : {
                    "meta" : [
                      {
                        "value" : "value2",
                        "key" : "key2"
                      }
                    ]
                  }
                }
              ]
            }
          }
        }
      ]
    }
  }

Now, I want to filter values from the above aggrgation and get buckets only when each bucket has both { "key": "key1", "value": "value1" } and { "key": "key2", "value": "value2" } in meta. Something like this

{
  "query": {
    "bool": {
      "must": [
        {
          "nested": {
            "path": "buckets.all_meta.hits.hits._source.meta",
            "query": {
              "bool": {
                "must": [
                  {
                    "match": {
                      "buckets.all_meta.hits.hits._source.meta.key": "key1"
                    }
                  },
                  {
                    "match": {
                      "buckets.all_meta.hits.hits._source.meta.value": "value1"
                    }
                  }
                ]
              }
            }
          }
        },
        {
          "nested": {
            "path": "buckets.all_meta.hits.hits._source.meta",
            "query": {
              "bool": {
                "must": [
                  {
                    "match": {
                      "buckets.all_meta.hits.hits._source.meta.key": "key2"
                    }
                  },
                  {
                    "match": {
                      "buckets.all_meta.hits.hits._source.meta.value": "value2"
                    }
                  }
                ]
              }
            }
          }
        }
      ]
    }
  }
}

But I'm not sure how to do the above step. Is it possible to do this? This stackoverflow question is similar but it doesn't have any answers. Is there any other approach to get the results I want? Any help would be appreciated. Thanks.

troglodyte07
  • 3,598
  • 10
  • 42
  • 66
  • What you want is to only take into account documents that have both key1/value1 and key2/value2 and aggregate them, right? So you should only have a single bucket for `prod-1`, correct? – Val Nov 12 '20 at 12:27
  • @Val I want to get products that have both key1/value1 and key2/value2 but not necessarily in a single document. Here, prod-2 has both key1/value1 and key2/value2 but in different documents. I want both prod-1 and prod-2 in the results – troglodyte07 Nov 12 '20 at 12:33
  • Gotcha, makes sense now. – Val Nov 12 '20 at 12:33

1 Answers1

1

Here is a solution. The idea is that in each product bucket, we aggregate all key/value pairs (using a scripted terms aggregation) and then using a bucket_selector pipeline aggregation, we only select the the product buckets which have two different pairs.

POST products/_search
{
  "size": 0,
  "aggs": {
    "by_product": {
      "terms": {
        "field": "product_name.keyword"
      },
      "aggs": {
        "meta": {
          "nested": {
            "path": "meta"
          },
          "aggs": {
            "kv": {
              "terms": {
                "script": """
                [doc['meta.key.keyword'].value, doc['meta.value.keyword'].value].join('-')
                """,
                "size": 10
              }
            }
          }
        },
        "selector": {
          "bucket_selector": {
            "buckets_path": {
              "count": "meta>kv._bucket_count"
            },
            "script": "params.count == 2"
          }
        }
      }
    }
  }
}

In the results, you can see that we only have prod-1 and prod-2`:

  "buckets" : [
    {
      "key" : "prod-2",
      "doc_count" : 2,
      "meta" : {
        "doc_count" : 2,
        "kv" : {
          "doc_count_error_upper_bound" : 0,
          "sum_other_doc_count" : 0,
          "buckets" : [
            {
              "key" : "key1-value1",
              "doc_count" : 1
            },
            {
              "key" : "key2-value2",
              "doc_count" : 1
            }
          ]
        }
      }
    },
    {
      "key" : "prod-1",
      "doc_count" : 1,
      "meta" : {
        "doc_count" : 2,
        "kv" : {
          "doc_count_error_upper_bound" : 0,
          "sum_other_doc_count" : 0,
          "buckets" : [
            {
              "key" : "key1-value1",
              "doc_count" : 1
            },
            {
              "key" : "key2-value2",
              "doc_count" : 1
            }
          ]
        }
      }
    }
  ]
Val
  • 207,596
  • 13
  • 358
  • 360