I'm trying to load a pretty confusing multi-nested JSON
into pandas. I'm already using json_normalize but trying to figure out how I'd join 2 similair nested dict
s as well as unpack their sub dict
s and list
s has been stumping me. I have limited understanding of pandas but I'm assuming I can leverage it's performance advantages if I can just get this down.
I have 2 dict's containing war data, one loaded from a JSON API response and one in a database. I'm trying to compare the 2 for new attacks and defenses.
example war
{
"state": "active",
"team_size": 20,
"teams": {
"id": "12345679",
"name": "Good Guys",
"level": 10,
"attacks": 4,
"destruction_percentage": 22.6,
"members": [
{
"id": "1",
"name": "John",
"level": 12
},
{
"id": "2",
"name": "Tom",
"level": 11,
"attacks": [
{
"attackerTag": "2",
"defenderTag": "4",
"damage": 64,
"order": 7
}
]
}
]
},
"opponent": {
"id": "987654321",
"name": "Bad Guys",
"level": 17,
"attacks": 5,
"damage": 20.95,
"members": [
{
"id": "3",
"name": "Betty",
"level": 17,
"attacks": [
{
"attacker_id": "3",
"defender_id": "1",
"damage": 70,
"order": 1
},
{
"attacker_id": "3",
"defender_id": "7",
"damage": 100,
"order": 11
}
],
"opponentAttacks": 0,
"some_useless_data": "Want to ignore, this doesn't show in every record"
},
{
"id": "4",
"name": "Fred",
"level": 9,
"attacks": [
{
"attacker_id": "4",
"defender_id": "9",
"damage": 70,
"order": 4
}
],
"opponentAttacks": 0
}
]
}
}
Now I'm assuming pandas would be my best option as far as performance goes rather than zipping them together and looping through each member and comparing them.
So my attempt to get a dataframe
that flattens nicely and easy to traverse have been difficult to say the least. Optimally I'd assume the following layout. I'm only trying to get both teams into a single df
of just all the members. we can omit the state
and team_size
keys and focus on just getting each member and their respective attacks
and team_id
's
example df
(expected, result):
id name level attacks member.team_id ...
1 John 12 NaN "123456789"
2 Tom 11 [{...}] "123456789"
3 Betty 17 [{...}, {...}] "987654321"
4 Fred 9 [{...}] "987654321"
That's the basic gist of what I'd like as a df
. So then I can take both dataframes and compare for new attacks.
Note I just pop()
'd state
and team_size
from the dict before my attempts as all I want is all members, with the team embedded into it pretty much
I have tried the following with no luck and I know it's not the correct way because it's working backwards up the dict tree.
old_df = json_normalize(war,
'members',
['id', 'name', 'level', 'attacks'],
record_prefix='member')
#Traceback (most recent call last):
# File "test.py", line 83, in <module>
# new_parse(old_war, new_war)
# File "test.py", line 79, in new_parse
# record_prefix='member')
# File "/home/jbacher/.local/lib/python3.7/site-packages/pandas/io/json/normalize.py", line 262, in json_normalize
# _recursive_extract(data, record_path, {}, level=0)
# File "/home/jbacher/.local/lib/python3.7/site-packages/pandas/io/json/normalize.py", line 238, in _recursive_extract
# recs = _pull_field(obj, path[0])
# File "/home/jbacher/.local/lib/python3.7/site-packages/pandas/io/json/normalize.py", line 185, in _pull_field
# result = result[spec]
#KeyError: 'members'
I assumed I could use something like the following but that isn't working either.
df = pd.DataFrame.from_dict(old, orient='index')
df.droplevel('members')
#Traceback (most recent call last):
# File "test.py", line 106, in <module>
# new_parse(old_war, new_war)
# File "test.py", line 87, in new_parse
# df.droplevel('members')
# File "/home/jbacher/.local/lib/python3.7/site-packages/pandas/core/generic.py", line 4376, in __getattr__
# return object.__getattribute__(self, name)
#AttributeError: 'DataFrame' object has no attribute 'droplevel'
I appreciate any guidance! Hopefully I put enough in to help understand my intended result, if not please let me know!
Edit To be fair, I do know how to do this just looping the dict and creating a new list of members with the appropriate date, but this I feel is much more inefficient than using pandas as I'm doing this for millions of wars in a threaded application and every bit of performance I can get out of it is a bonus for me and the app. - Thanks again!