3

I am having hard time translating results from elasticsearch aggregations to pandas. I am trying to write an abstract function which would take nested dictionary (arbitrary number of levels) and flatten them into a pandas dataframe

Here is how a typical result look like

-- edit : I added the parent key as well

x1 = {u'xColor': {u'buckets': [{u'doc_count': 4,
u'key': u'red',
u'xMake': {u'buckets': [{u'doc_count': 3,
   u'key': u'honda',
   u'xCity': {u'buckets': [{u'doc_count': 2, u'key': u'ROME'},
     {u'doc_count': 1, u'key': u'Paris'}],
    u'doc_count_error_upper_bound': 0,
    u'sum_other_doc_count': 0}},
  {u'doc_count': 1,
   u'key': u'bmw',
   u'xCity': {u'buckets': [{u'doc_count': 1, u'key': u'Paris'}],
    u'doc_count_error_upper_bound': 0,
    u'sum_other_doc_count': 0}}],
 u'doc_count_error_upper_bound': 0,
 u'sum_other_doc_count': 0}},
 {u'doc_count': 2,
u'key': u'blue',
u'xMake': {u'buckets': [{u'doc_count': 1,
   u'key': u'ford',
   u'xCity': {u'buckets': [{u'doc_count': 1, u'key': u'Paris'}],
    u'doc_count_error_upper_bound': 0,
    u'sum_other_doc_count': 0}},
  {u'doc_count': 1,
   u'key': u'toyota',
   u'xCity': {u'buckets': [{u'doc_count': 1, u'key': u'Berlin'}],
    u'doc_count_error_upper_bound': 0,
    u'sum_other_doc_count': 0}}],
 u'doc_count_error_upper_bound': 0,
   u'sum_other_doc_count': 0}},
    {u'doc_count': 2,
u'key': u'green',
u'xMake': {u'buckets': [{u'doc_count': 1,
   u'key': u'ford',
   u'xCity': {u'buckets': [{u'doc_count': 1, u'key': u'Berlin'}],
    u'doc_count_error_upper_bound': 0,
    u'sum_other_doc_count': 0}},
    {u'doc_count': 1,
      u'key': u'toyota',
     u'xCity': {u'buckets': [{u'doc_count': 1, u'key': u'Berlin'}],
    u'doc_count_error_upper_bound': 0,
    u'sum_other_doc_count': 0}}],
 u'doc_count_error_upper_bound': 0,
 u'sum_other_doc_count': 0}}],
 u'doc_count_error_upper_bound': 0,
 u'sum_other_doc_count': 0}}

what I would like to have is a dataframe with the doc_count of the lowest level

for the first record

 red-honda-rome-2 

 red-honda-paris-1

 red-bmw-paris-1

I came across json_normalize in pandas here but do not understand how to put the arguments and I and have seen different suggestions for flattening a nested dictionary but can't really understand how they work. Any help to get me started would be appreciated Elasticsearch result to table

UPDATE

I tried to use dpath which is a great library, but I do not see how to abstract this (in form of a function which takes just the buckets names as arguments) as dpath cannot handle the structure in which values are lists (and not other dictionaries)

import dpath 
import pandas as pd 

xListData = []
for q1 in dpath.util.get(x1, 'xColor/buckets'):
      xColor = q1['key']
for q2 in dpath.util.get(q1, 'xMake/buckets'):
    #print '--', q2['key']
    xMake = q2['key']
    for q3 in dpath.util.get(q2, 'xCity/buckets'):
        #xDict = []
        xCity = q3['key']
        doc_count = q3['doc_count']
        xDict = {'color': xColor, 'make': xMake, 'city': xCity, 'doc_count': doc_count}
        #print '------', q3['key'], q3['doc_count']
        xListData.append(xDict)

pd.DataFrame(xListData)

This gives:

city    color   doc_count   make
0   ROME    red     2   honda
1   Paris   red     1   honda
2   Paris   red     1   bmw
3   Paris   blue    1   ford
4   Berlin  blue    1   toyota
5   Berlin  green   1   ford
6   Berlin  green   1   toyota
Community
  • 1
  • 1
user1043144
  • 2,680
  • 5
  • 29
  • 45

2 Answers2

2

Try with a recursive function:

import pandas as pd
def elasticToDataframe(elasticResult,aggStructure,record={},fulllist=[]):
    for agg in aggStructure:
        buckets = elasticResult[agg['key']]['buckets']
        for bucket in buckets:
            record = record.copy()
            record[agg['key']] = bucket['key']
            if 'aggs' in agg: 
                elasticToDataframe(bucket,agg['aggs'],record,fulllist)
            else: 
                for var in agg['variables']:
                    record[var['dfName']] = bucket[var['elasticName']]

                fulllist.append(record)

    df = pd.DataFrame(fulllist)
    return df

Then call the function with your data (x1) and a properly configured 'aggStructure' dict. The nested nature of the data must be reflected in this dict.

aggStructure=[{'key':'xColor','aggs':[{'key':'xMake','aggs':[{'key':'xCity','variables':[{'elasticName':'doc_count','dfName':'count'}]}]}]}]
elasticToDataframe(x1,aggStructure)

enter image description here

Cheers

Sleenee
  • 594
  • 1
  • 8
  • 21
  • many Thanks. fantastic ! defining the nested nature of the data is a good idea as it adds also the flexibility of the function. – user1043144 Apr 01 '15 at 09:33
  • Using mutable default argument is not a good idea. [See here more](http://stackoverflow.com/questions/1132941/least-astonishment-and-the-mutable-default-argument) – David206 Dec 14 '16 at 14:44
2

There is a project could do this out of box: https://github.com/onesuper/pandasticsearch

It could also be done manually with a recursive generator and MultiIndex functionality:

https://github.com/onesuper/pandasticsearch/blob/master/pandasticsearch/query.py#L125

onesuper
  • 196
  • 2
  • 7