4

I'm trying to load a pretty confusing multi-nested JSON into pandas. I'm already using json_normalize but trying to figure out how I'd join 2 similair nested dicts as well as unpack their sub dicts and lists has been stumping me. I have limited understanding of pandas but I'm assuming I can leverage it's performance advantages if I can just get this down.

I have 2 dict's containing war data, one loaded from a JSON API response and one in a database. I'm trying to compare the 2 for new attacks and defenses.

example war

{
  "state": "active",
  "team_size": 20,
  "teams": {
    "id": "12345679",
    "name": "Good Guys",
    "level": 10,
    "attacks": 4,
    "destruction_percentage": 22.6,
    "members": [
      {
        "id": "1",
        "name": "John",
        "level": 12
      },
      {
        "id": "2",
        "name": "Tom",
        "level": 11,
        "attacks": [
          {
            "attackerTag": "2",
            "defenderTag": "4",
            "damage": 64,
            "order": 7
          }
        ]
      }
    ]
  },
  "opponent": {
    "id": "987654321",
    "name": "Bad Guys",
    "level": 17,
    "attacks": 5,
    "damage": 20.95,
    "members": [
      {
        "id": "3",
        "name": "Betty",
        "level": 17,
        "attacks": [
          {
            "attacker_id": "3",
            "defender_id": "1",
            "damage": 70,
            "order": 1
          },
          {
            "attacker_id": "3",
            "defender_id": "7",
            "damage": 100,
            "order": 11
          }
        ],
        "opponentAttacks": 0,
        "some_useless_data": "Want to ignore, this doesn't show in every record"
      },
      {
        "id": "4",
        "name": "Fred",
        "level": 9,
        "attacks": [
          {
            "attacker_id": "4",
            "defender_id": "9",
            "damage": 70,
            "order": 4
          }
        ],
        "opponentAttacks": 0
      }
    ]
  }
}

Now I'm assuming pandas would be my best option as far as performance goes rather than zipping them together and looping through each member and comparing them.

So my attempt to get a dataframe that flattens nicely and easy to traverse have been difficult to say the least. Optimally I'd assume the following layout. I'm only trying to get both teams into a single df of just all the members. we can omit the state and team_size keys and focus on just getting each member and their respective attacks and team_id's

example df (expected, result):

id   name   level  attacks         member.team_id  ...
1    John   12     NaN             "123456789"
2    Tom    11     [{...}]         "123456789"
3    Betty  17     [{...}, {...}]  "987654321"
4    Fred   9      [{...}]         "987654321"

That's the basic gist of what I'd like as a df. So then I can take both dataframes and compare for new attacks.

Note I just pop()'d state and team_size from the dict before my attempts as all I want is all members, with the team embedded into it pretty much

I have tried the following with no luck and I know it's not the correct way because it's working backwards up the dict tree.

old_df = json_normalize(war,
                        'members',
                        ['id', 'name', 'level', 'attacks'],
                        record_prefix='member')

#Traceback (most recent call last):
#  File "test.py", line 83, in <module>
#    new_parse(old_war, new_war)
#  File "test.py", line 79, in new_parse
#    record_prefix='member')
#  File "/home/jbacher/.local/lib/python3.7/site-packages/pandas/io/json/normalize.py", line 262, in json_normalize
#    _recursive_extract(data, record_path, {}, level=0)
#  File "/home/jbacher/.local/lib/python3.7/site-packages/pandas/io/json/normalize.py", line 238, in _recursive_extract
#    recs = _pull_field(obj, path[0])
#  File "/home/jbacher/.local/lib/python3.7/site-packages/pandas/io/json/normalize.py", line 185, in _pull_field
#    result = result[spec]
#KeyError: 'members'

I assumed I could use something like the following but that isn't working either.

df = pd.DataFrame.from_dict(old, orient='index')
df.droplevel('members')

#Traceback (most recent call last):
#  File "test.py", line 106, in <module>
#    new_parse(old_war, new_war)
#  File "test.py", line 87, in new_parse
#    df.droplevel('members')
#  File "/home/jbacher/.local/lib/python3.7/site-packages/pandas/core/generic.py", line 4376, in __getattr__
#    return object.__getattribute__(self, name)
#AttributeError: 'DataFrame' object has no attribute 'droplevel'

I appreciate any guidance! Hopefully I put enough in to help understand my intended result, if not please let me know!

Edit To be fair, I do know how to do this just looping the dict and creating a new list of members with the appropriate date, but this I feel is much more inefficient than using pandas as I'm doing this for millions of wars in a threaded application and every bit of performance I can get out of it is a bonus for me and the app. - Thanks again!

Jab
  • 26,853
  • 21
  • 75
  • 114
  • Is possible create valid json sample data? – jezrael Jan 31 '19 at 07:46
  • That's an interesting question but as jezrael mentioned you should provide a json you'd like us to work with. Not just a pseudo one.-- My bet would be to try normalizing the members list of each team, and append the 2 dfs. I'm on my Phone but you can find some post here on how to use json_normalize() – IMCoins Jan 31 '19 at 07:53
  • I corrected, I took out some stuff and in doing so I broke the JSON format. Also removed the `...` from it as well. Just know there obvously more than just 2 members in a team. lol sorry – Jab Jan 31 '19 at 07:54
  • Is necessary swap prefix in columns names in my answer? – jezrael Jan 31 '19 at 08:13
  • As you say "I'm doing this for millions of wars in a threaded application", does that mean that all wars have the same dictionary-structure? – U13-Forward Jan 31 '19 at 08:23
  • yes @U9-Forward – Jab Jan 31 '19 at 08:39
  • @jezrael do you mean by swap prefix names that the `id` and `name`... fields are the same in member and team? If so then how do I discern the two in your answer? – Jab Jan 31 '19 at 08:42
  • @Jaba - yes, you need `member.` prefix for non prefix values, and for prefixes need remove it. So added code for correct it by `df1.columns = np.where(df1.columns.str.startswith('member.'), df1.columns.str.split('.', n=1).str[1], 'member.' + df1.columns) ` (or use it after concat with `df.columns = np.where(df.columns.str.startswith('member.'), df.columns.str.split('.', n=1).str[1], 'member.' + df.columns)`). – jezrael Jan 31 '19 at 08:44
  • 1
    I saw that, it's got what I need to start now just need to apply to my actual application. I put dummy json but closely resembles what I need. Thank you! – Jab Jan 31 '19 at 08:50
  • 1
    I sent you a email about 30 mins ago – Jab Jan 31 '19 at 09:09
  • @Jaba Oh! thanks! i just saw it. – U13-Forward Jan 31 '19 at 09:13
  • @Jaba Replied to it. – U13-Forward Jan 31 '19 at 09:47

2 Answers2

1

I believe you can use:

need = ['member.id', 'member.name', 'member.level', 'member.attacks','id']
df1 = json_normalize(war['teams'],
                     'members',
                     ['id', 'name', 'level', 'attacks'], 
                     record_prefix='member.')[need]
#print (df1)

df2 = json_normalize(war['opponent'],
                     'members',
                     ['id', 'name', 'level', 'attacks'], 
                     record_prefix='member.')[need]
#print (df2)


df1.columns = np.where(df1.columns.str.startswith('member.'), 
                       df1.columns.str.split('.', n=1).str[1],
                       'member.' + df1.columns)
df2.columns = np.where(df2.columns.str.startswith('member.'), 
                       df2.columns.str.split('.', n=1).str[1],
                       'member.' + df2.columns)


df = pd.concat([df1, df2], sort=False, ignore_index=True)
print (df)
  id   name  level                                            attacks  \
0  1   John     12                                                NaN   
1  2    Tom     11  [{'attackerTag': '2', 'defenderTag': '4', 'dam...   
2  3  Betty     17  [{'attacker_id': '3', 'defender_id': '1', 'dam...   
3  4   Fred      9  [{'attacker_id': '4', 'defender_id': '9', 'dam...   

   member.id  
0   12345679  
1   12345679  
2  987654321  
3  987654321  
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Where's `member.team_id`, also, it isn't OP's desired output. – U13-Forward Jan 31 '19 at 08:08
  • @U9-Forward - there are swapped `member_` with no `prefix` columns names. If necessary, I can change it. But first I want wait if my solution is generally working or not. – jezrael Jan 31 '19 at 08:11
  • 1
    Yes, this works, although like @U9-Forward said where does the `team` come into play? Like how do I add it as at least just a column representing the `id` and how do I remove my unwanted data? – Jab Jan 31 '19 at 08:21
  • 1
    @Jaba - So `...` in expected output `id name level attacks member.team_id ...` is necessary ignore ? – jezrael Jan 31 '19 at 08:22
  • @Jaba - I understand it need all columns – jezrael Jan 31 '19 at 08:22
  • I just need a way to correlate each member to their team – Jab Jan 31 '19 at 08:25
  • I mean I could find the team by the member's id but seems like an extra step to take than having it readily available within the member just need to consider performance tradeoff – Jab Jan 31 '19 at 08:26
  • 1
    Thank you! This is a godsend. I'm gonna try a timeit on this but I'm going to say right now that this is way faster than my original approach. haha thanks so much! – Jab Jan 31 '19 at 08:36
  • @jezrael I see you're using numpy to get the team id's. will this work if the id's are alphanumeric strings? As my example had numbers for id's but the id's contain letters and special characters as well. ex: `"#123ABC"` – Jab Jan 31 '19 at 22:41
  • @Jaba It depends if numeric values like `345` in colimns names. If raise error, try convert to strings before np. where like `df.columns = df. columns. astype(str)`. Now I am going to bed, so if still some problem, tomorrow I try find solution. – jezrael Jan 31 '19 at 23:11
1

Try using this four-liner:

d=war['teams']['members']+war['teams']['opponent']['members']
df = pd.DataFrame(d)
df = df.iloc[:,:4][['id','name','level','attacks']]
df['member.team_id']=[war['teams']['opponent']['id'] if i in war['teams']['opponent']['members'] else war['teams']['id'] for i in d]
print(df)

Output:

  id   name  level                                            attacks  \
0  1   John     12                                                NaN   
1  2    Tom     11  [{'attackerTag': '2', 'defenderTag': '4', 'dam...   
2  3  Betty     17  [{'attacker_id': '3', 'defender_id': '1', 'dam...   
3  4   Fred      9  [{'attacker_id': '4', 'defender_id': '9', 'dam...   

  member.team_id  
0       12345679  
1       12345679  
2      987654321  
3      987654321  
U13-Forward
  • 69,221
  • 14
  • 89
  • 114