I am having hard time translating results from elasticsearch aggregations to pandas. I am trying to write an abstract function which would take nested dictionary (arbitrary number of levels) and flatten them into a pandas dataframe
Here is how a typical result look like
-- edit : I added the parent key as well
x1 = {u'xColor': {u'buckets': [{u'doc_count': 4,
u'key': u'red',
u'xMake': {u'buckets': [{u'doc_count': 3,
u'key': u'honda',
u'xCity': {u'buckets': [{u'doc_count': 2, u'key': u'ROME'},
{u'doc_count': 1, u'key': u'Paris'}],
u'doc_count_error_upper_bound': 0,
u'sum_other_doc_count': 0}},
{u'doc_count': 1,
u'key': u'bmw',
u'xCity': {u'buckets': [{u'doc_count': 1, u'key': u'Paris'}],
u'doc_count_error_upper_bound': 0,
u'sum_other_doc_count': 0}}],
u'doc_count_error_upper_bound': 0,
u'sum_other_doc_count': 0}},
{u'doc_count': 2,
u'key': u'blue',
u'xMake': {u'buckets': [{u'doc_count': 1,
u'key': u'ford',
u'xCity': {u'buckets': [{u'doc_count': 1, u'key': u'Paris'}],
u'doc_count_error_upper_bound': 0,
u'sum_other_doc_count': 0}},
{u'doc_count': 1,
u'key': u'toyota',
u'xCity': {u'buckets': [{u'doc_count': 1, u'key': u'Berlin'}],
u'doc_count_error_upper_bound': 0,
u'sum_other_doc_count': 0}}],
u'doc_count_error_upper_bound': 0,
u'sum_other_doc_count': 0}},
{u'doc_count': 2,
u'key': u'green',
u'xMake': {u'buckets': [{u'doc_count': 1,
u'key': u'ford',
u'xCity': {u'buckets': [{u'doc_count': 1, u'key': u'Berlin'}],
u'doc_count_error_upper_bound': 0,
u'sum_other_doc_count': 0}},
{u'doc_count': 1,
u'key': u'toyota',
u'xCity': {u'buckets': [{u'doc_count': 1, u'key': u'Berlin'}],
u'doc_count_error_upper_bound': 0,
u'sum_other_doc_count': 0}}],
u'doc_count_error_upper_bound': 0,
u'sum_other_doc_count': 0}}],
u'doc_count_error_upper_bound': 0,
u'sum_other_doc_count': 0}}
what I would like to have is a dataframe with the doc_count of the lowest level
for the first record
red-honda-rome-2
red-honda-paris-1
red-bmw-paris-1
I came across json_normalize in pandas here but do not understand how to put the arguments and I and have seen different suggestions for flattening a nested dictionary but can't really understand how they work. Any help to get me started would be appreciated Elasticsearch result to table
UPDATE
I tried to use dpath which is a great library, but I do not see how to abstract this (in form of a function which takes just the buckets names as arguments) as dpath cannot handle the structure in which values are lists (and not other dictionaries)
import dpath
import pandas as pd
xListData = []
for q1 in dpath.util.get(x1, 'xColor/buckets'):
xColor = q1['key']
for q2 in dpath.util.get(q1, 'xMake/buckets'):
#print '--', q2['key']
xMake = q2['key']
for q3 in dpath.util.get(q2, 'xCity/buckets'):
#xDict = []
xCity = q3['key']
doc_count = q3['doc_count']
xDict = {'color': xColor, 'make': xMake, 'city': xCity, 'doc_count': doc_count}
#print '------', q3['key'], q3['doc_count']
xListData.append(xDict)
pd.DataFrame(xListData)
This gives:
city color doc_count make
0 ROME red 2 honda
1 Paris red 1 honda
2 Paris red 1 bmw
3 Paris blue 1 ford
4 Berlin blue 1 toyota
5 Berlin green 1 ford
6 Berlin green 1 toyota