ElasticSearch aggregating by a nested field with variable nesting (or over particular json field)

Question

I have the following structure GET /index-*/_mapping:

    "top_field" : {
      "properties" : {
        "dict_key1" : {
          "properties" : {
            "field1" : {...},
            "field2" : {...},
            "field3" : {
              "type" : "text",
              "fields" : {
                "keyword" : {
                  "type" : "keyword",
                  "ignore_above" : 256
                }
              }
            },
            "field4" : {...}
          },
        "dict_key2" : {
          "properties" : {
            "field1" : {...},
            "field2" : {...},
            "field3" : {
              "type" : "text",
              "fields" : {
                "keyword" : {
                  "type" : "keyword",
                  "ignore_above" : 256
                }
              }
            },
            "field4" : {...}
          },
        "dict_key3": ...
        }

In other words, top_field stores a json.

I would like to aggregate over 'field3.keyword' regardless of dict_key*. Something like top_field.*.field3.keyword.

However, I can't get it to work using terms aggregation, with or without nested. I also tried to just to bucket by the different dict_key*, which would be almost as good, but I can't get this to work either.

How can I do this?

score 1 · Accepted Answer · answered Nov 22 '20 at 12:47

1

TL;DR I had the same problem some time ago (Terms aggregation with nested wildcard path) and it turns out it's not directly possible due to the way lookups and path accessors are performed.

There's a scripted workaround though:

{
  "size": 0,
  "aggs": {
    "terms_emulator": {
      "scripted_metric": {
        "init_script": "state.keyword_counts = [:]",
        "map_script": """
          def source = params._source['top_field'];
          for (def key : source.keySet()) {
            if (!source[key].containsKey('field3')) continue;
            
            def field3_kw = source[key]['field3'];
        
            if (state.keyword_counts.containsKey(field3_kw)) { 
              state.keyword_counts[field3_kw] += 1;
            } else {
              state.keyword_counts[field3_kw] = 1;
            }
          }
        """,
        "combine_script": "state",
        "reduce_script": "states[0]"
      }
    }
  }
}

yielding something along the lines of

"aggregations" : {
  "terms_emulator" : {
    "value" : {
      "keyword_counts" : {
        "world" : 1,
        "hello" : 2
      }
    }
  }
}

While this works just fine, I'd disadvise using scripts in production. You could rather restructure your data such that straightforward lookups are possible. For instance:

{
  "top_field": {
    "entries": [
      {
        "group_name": "dict_key1",
        "key_value_pairs": {
          "field3": "hello"
        }
      },
      {
        "group_name": "dict_key2",
        "key_value_pairs": {
          "field3": "world"
        }
      }
    ]
  }
}

and make entries nested. Maybe even ditch top_field since it seems redundant and start directly w/ entries.

answered Nov 22 '20 at 12:47

Joe - GMapsBook.com

15,787
4
23
68

Thanks, I will take a look at this and let you know. Unfortunately, `top_field` cannot be ditched because it is part of a collection of other fields - ones that are used in aggregation as well. However, is there no way to simply access the "values" of the {dict_key* : values} pairs as a list and then it would essentially be a nested aggregation. Using your post that you linked: something like `"path": "nested_parent.values()"` or some other way to interface with it. – AOK Nov 22 '20 at 13:37
And I can now confirm that this does work. I will wait a bit before accepting since there might be a solution out there that does not involve scripting. – AOK Nov 22 '20 at 14:05
Gotcha. Yea no there's no such interface -- at least not outside of the script context. Plus, whenever the context encounters a `nested` data type, it can only access the 'current' iteration. That's why we had to use `params._source` to gain access to all of the top level doc's attributes, i.e. `"nested_parent.values()"`... – Joe - GMapsBook.com Nov 22 '20 at 22:14
Ultimately, I got the data structure changed and educated everyone working on our system of proper document design in ES. It seems that arrays of dictionaries are preferred to dictionary of dictionaries. – AOK Nov 23 '20 at 10:08
Nice. They certainly are preferred. Good luck with the rest of your implementation! – Joe - GMapsBook.com Nov 23 '20 at 10:42
Hey @AOK -- I've just launched my [Elasticsearch Handbook](https://elasticsearchbook.com/) and I think you'd find it useful! – Joe - GMapsBook.com Mar 18 '21 at 09:58

ElasticSearch aggregating by a nested field with variable nesting (or over particular json field)

1 Answers1

Linked