57

The only close thing that I've found was: Multiple group-by in Elasticsearch

Basically I'm trying to get the ES equivalent of the following MySql query:

select gender, age_range, count(distinct profile_id) as count 
FROM TABLE group by age_range, gender

The age and gender by themselves were easy to get:

{
  "query": {
    "match_all": {}
  },
  "facets": {
    "ages": {
      "terms": {
        "field": "age_range",
        "size": 20
      }
    },
    "gender_by_age": {
      "terms": {
        "fields": [
          "age_range",
          "gender"
        ]
      }
    }
  },
  "size": 0
}

which gives:

{
  "ages": {
    "_type": "terms",
    "missing": 0,
    "total": 193961,
    "other": 0,
    "terms": [
      {
        "term": 0,
        "count": 162643
      },
      {
        "term": 3,
        "count": 10683
      },
      {
        "term": 4,
        "count": 8931
      },
      {
        "term": 5,
        "count": 4690
      },
      {
        "term": 6,
        "count": 3647
      },
      {
        "term": 2,
        "count": 3247
      },
      {
        "term": 1,
        "count": 120
      }
    ]
  },
  "total_gender": {
    "_type": "terms",
    "missing": 0,
    "total": 193961,
    "other": 0,
    "terms": [
      {
        "term": 1,
        "count": 94799
      },
      {
        "term": 2,
        "count": 62645
      },
      {
        "term": 0,
        "count": 36517
      }
    ]
  }
}

But now I need something that looks like this:

[breakdown_gender] => Array
    (
        [1] => Array
            (
                [0] => 264
                [1] => 1
                [2] => 6
                [3] => 67
                [4] => 72
                [5] => 40
                [6] => 23
            )

        [2] => Array
            (
                [0] => 153
                [2] => 2
                [3] => 21
                [4] => 35
                [5] => 22
                [6] => 11
            )

    )

Please note that 0,1,2,3,4,5,6 are "mappings" for the age ranges so they actually mean something :) and not just numbers. e.g. Gender[1] (which is "male") breaks down into age range [0] (which is "under 18") with a count of 246.

Frederik Struck-Schøning
  • 12,981
  • 8
  • 59
  • 68
Pavel
  • 964
  • 1
  • 6
  • 18
  • I'm getting like when i call using curl 3{ "error" : { "root_cause" : [ { "type" : "parsing_exception", "reason" : "Unknown key for a START_OBJECT in [facets].", "line" : 6, "col" : 13 } ], "type" : "parsing_exception", "reason" : "Unknown key for a START_OBJECT in [facets].", "line" : 6, "col" : 13 }, "status" : 400 } – Yagnesh bhalala Mar 29 '19 at 16:38

4 Answers4

107

Starting from version 1.0 of ElasticSearch, the new aggregations API allows grouping by multiple fields, using sub-aggregations. Suppose you want to group by fields field1, field2 and field3:

{
  "aggs": {
    "agg1": {
      "terms": {
        "field": "field1"
      },
      "aggs": {
        "agg2": {
          "terms": {
            "field": "field2"
          },
          "aggs": {
            "agg3": {
              "terms": {
                "field": "field3"
              }
            }
          }          
        }
      }
    }
  }
}

Of course this can go on for as many fields as you'd like.

Update:
For completeness, here is how the output of the above query looks. Also below is python code for generating the aggregation query and flattening the result into a list of dictionaries.

{
  "aggregations": {
    "agg1": {
      "buckets": [{
        "doc_count": <count>,
        "key": <value of field1>,
        "agg2": {
          "buckets": [{
            "doc_count": <count>,
            "key": <value of field2>,
            "agg3": {
              "buckets": [{
                "doc_count": <count>,
                "key": <value of field3>
              },
              {
                "doc_count": <count>,
                "key": <value of field3>
              }, ...
              ]
            },
            {
            "doc_count": <count>,
            "key": <value of field2>,
            "agg3": {
              "buckets": [{
                "doc_count": <count>,
                "key": <value of field3>
              },
              {
                "doc_count": <count>,
                "key": <value of field3>
              }, ...
              ]
            }, ...
          ]
        },
        {
        "doc_count": <count>,
        "key": <value of field1>,
        "agg2": {
          "buckets": [{
            "doc_count": <count>,
            "key": <value of field2>,
            "agg3": {
              "buckets": [{
                "doc_count": <count>,
                "key": <value of field3>
              },
              {
                "doc_count": <count>,
                "key": <value of field3>
              }, ...
              ]
            },
            {
            "doc_count": <count>,
            "key": <value of field2>,
            "agg3": {
              "buckets": [{
                "doc_count": <count>,
                "key": <value of field3>
              },
              {
                "doc_count": <count>,
                "key": <value of field3>
              }, ...
              ]
            }, ...
          ]
        }, ...
      ]
    }
  }
}

The following python code performs the group-by given the list of fields. I you specify include_missing=True, it also includes combinations of values where some of the fields are missing (you don't need it if you have version 2.0 of Elasticsearch thanks to this)

def group_by(es, fields, include_missing):
    current_level_terms = {'terms': {'field': fields[0]}}
    agg_spec = {fields[0]: current_level_terms}

    if include_missing:
        current_level_missing = {'missing': {'field': fields[0]}}
        agg_spec[fields[0] + '_missing'] = current_level_missing

    for field in fields[1:]:
        next_level_terms = {'terms': {'field': field}}
        current_level_terms['aggs'] = {
            field: next_level_terms,
        }

        if include_missing:
            next_level_missing = {'missing': {'field': field}}
            current_level_terms['aggs'][field + '_missing'] = next_level_missing
            current_level_missing['aggs'] = {
                field: next_level_terms,
                field + '_missing': next_level_missing,
            }
            current_level_missing = next_level_missing

        current_level_terms = next_level_terms

    agg_result = es.search(body={'aggs': agg_spec})['aggregations']
    return get_docs_from_agg_result(agg_result, fields, include_missing)


def get_docs_from_agg_result(agg_result, fields, include_missing):
    current_field = fields[0]
    buckets = agg_result[current_field]['buckets']
    if include_missing:
        buckets.append(agg_result[(current_field + '_missing')])

    if len(fields) == 1:
        return [
            {
                current_field: bucket.get('key'),
                'doc_count': bucket['doc_count'],
            }
            for bucket in buckets if bucket['doc_count'] > 0
        ]

    result = []
    for bucket in buckets:
        records = get_docs_from_agg_result(bucket, fields[1:], include_missing)
        value = bucket.get('key')
        for record in records:
            record[current_field] = value
        result.extend(records)

    return result
Joe
  • 2,994
  • 5
  • 31
  • 34
  • I am getting an error like Unrecognized token "my fields value" . How can I fix this ? – Happy Coder Sep 15 '14 at 06:49
  • using sub-aggregations for large data and changing the format of it's response to a two column table with simple coding, can take a rather long time. is there another way to do this? – Makan Feb 16 '15 at 16:50
  • @HappyCoder - can you add more details about the problem you're having? For example - what is the query you're using? – Joe Feb 27 '15 at 12:56
  • @MakanTayebi - may I ask which programming language are you using? – Joe Feb 27 '15 at 12:57
  • I am coding with PHP. I could handle this specific task with a C module, but of course I'd prefer the elasticsearch to do this on its own. – Makan Feb 28 '15 at 06:33
  • @MakanTayebi - I added python code that does what you need (I think). I'm not fluent enough in PHP. Hope it helps. – Joe Jul 23 '15 at 14:03
  • Well, I have been assigned to another project currently. but thanx anyway – Makan Jul 23 '15 at 18:10
24

As you only have 2 fields a simple way is doing two queries with single facets. For Male:

{
    "query" : {
      "term" : { "gender" : "Male" }
    },
    "facets" : {
        "age_range" : {
            "terms" : {
                "field" : "age_range"
            }
        }
    }
}

And for female:

{
    "query" : {
      "term" : { "gender" : "Female" }
    },
    "facets" : {
        "age_range" : {
            "terms" : {
                "field" : "age_range"
            }
        }
    }
}

Or you can do it in a single query with a facet filter (see this link for further information)

{
    "query" : {
       "match_all": {}
    },
    "facets" : {
        "age_range_male" : {
            "terms" : {
                "field" : "age_range"
            },
            "facet_filter":{
                "term": {
                    "gender": "Male"
                }
            }
        },
        "age_range_female" : {
            "terms" : {
                "field" : "age_range"
            },
            "facet_filter":{
                "term": {
                    "gender": "Female"
                }
            }
        }
    }
}

Update:

As facets are about to be removed. This is the solution with aggregations:

{
  "query": {
    "match_all": {}
  },
  "aggs": {
    "male": {
      "filter": {
        "term": {
          "gender": "Male"
        }
      },
      "aggs": {
        "age_range": {
          "terms": {
            "field": "age_range"
          }
        }
      }
    },
    "female": {
      "filter": {
        "term": {
          "gender": "Female"
        }
      },
      "aggs": {
        "age_range": {
          "terms": {
            "field": "age_range"
          }
        }
      }
    }
  }
}
Frederik Struck-Schøning
  • 12,981
  • 8
  • 59
  • 68
moliware
  • 10,160
  • 3
  • 37
  • 47
  • 2
    As on Wednesday October 28, 2015, the elasticsearch official website states "Facets are deprecated and will be removed in a future release. You are encouraged to migrate to aggregations instead". – Ankur Oct 28 '15 at 13:12
  • can I have date_histogram as one aggregation? – the_learner Jan 30 '18 at 14:47
0

I have tried grouping profiles on organization yearly revenue and the count will then further distributed among industries using the following query

Example:

{
"size": 0,
"aggs": {
    "categories": {
        "filter": {
            "exists": {
                "field": "organization_industries"
            }
        },
        "aggs": {
            "names": {
                "terms": {
                    "field": "organization_revenue_in_thousands_int.keyword",
                    "size": 200,
                    "order": {
                        "_key": "desc"
                    }
                },
                "aggs": {
                    "industry_stats": {
                        "terms": {
                            "field": "organization_industries.keyword"
                        }
                    }
                }
            }
        }
    }
}

}

Output:

    "aggregations": {
    "categories": {
        "doc_count": 195161605,
        "names": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 19226983,
            "buckets": [
                {
                    "key": "99900",
                    "doc_count": 1742,
                    "industry_stats": {
                        "doc_count_error_upper_bound": 0,
                        "sum_other_doc_count": 0,
                        "buckets": [
                            {
                                "key": "internet",
                                "doc_count": 1605
                            },
                            {
                                "key": "investment management",
                                "doc_count": 81
                            },
                            {
                                "key": "biotechnology",
                                "doc_count": 54
                            },
                            {
                                "key": "computer & network security",
                                "doc_count": 2
                            }
                        ]
                    }
                },                
                {
                    "key": "998000",
                    "doc_count": 71,
                    "industry_stats": {
                        "doc_count_error_upper_bound": 0,
                        "sum_other_doc_count": 0,
                        "buckets": [
                            {
                                "key": "finance",
                                "doc_count": 48
                            },
                            {
                                "key": "information technology & services",
                                "doc_count": 23
                            }
                        ]
                    }
                }
                
                }
            ]
        }
    }
Frederik Struck-Schøning
  • 12,981
  • 8
  • 59
  • 68
Subhamay
  • 197
  • 1
  • 5
0

I know, it doesn't answer the question, but I found this page while looking for a way to do multi terms aggregation. Finally, found info about this functionality in the documentation. Maybe it will help somebody... multi_terms aggregation:

      "aggs": {
        "lat_lng": {
          "multi_terms": {
            "terms": [{
              "field": "lat"
            },{
              "field": "lng"
            }]
          }
        }
      }

The result will be something close to

    ...
        {
          "key" : [
            "43.00861889999999",
            "-78.8186202"
          ],
          "key_as_string" : "43.00861889999999|-78.8186202",
          "doc_count" : 6
        },
    ...
Frederik Struck-Schøning
  • 12,981
  • 8
  • 59
  • 68
Alexey Zalyotov
  • 326
  • 2
  • 5
  • 16