0

I spent a few hours searching for hints on how to do this, and tried a bunch of things (see below). I'm not getting anywhere, so I finally decided to post a new question.

I have a nested JSON with a dictionary data structure, like this:

for k,v in d.items():
    print(f'{k} = {v}')

First two keys:

obj1 = {'color': 'red', 'size': 'large', 'description': 'a large red ball'}
obj2 = {'color': 'blue', 'size': 'small', 'description': 'a small blue ball'}

Side question: is this actually a nested json? Each key (obj1, obj2) has a set of keys, so I think so but I'm not sure.

I then have a dataframe like this:

df
key   id_num   source
obj1  143  loc1
obj2  139  loc1

I want to map only 'size' and 'description' from my json dictionary to this dataframe, by key. And I want to do that efficiently and readably. I also want it to be robust to the presence of the key, so that if a key doesn't exist in the JSON dict, it just prints "NA" or something.

Things I've tried that got me closest (I tried to map one column at a time, and both at same time):

df['description'] = df['key'].map(d['description'])

df['description'] = df['key'].map(lambda x: d[x]['description'])

df2 = df.join(pd.DataFrame.from_dict(d, orient='index', columns=['size','description']), on='key')

The first one - it's obvious why this doesn't work. It prints KeyError: 'description', as expected. The second one I think would work, but there is a key in my dataframe that doesn't exist in my JSON dict. It prints KeyError: 'obj42' (an object in my df but not in d). The third one works, but requires creating a new dataframe which I'd like to avoid.

How can I make Solution #2 robust to missing keys? Also, is there a way to assign both columns at the same time without creating a new df? I found a way to assign all values in the dict here, but that's not what I want. I only want a couple.

There's always a possibility that my search keywords were not quite right, so if a post exists that answers my question please do let me know and I can delete this one.

Jess
  • 186
  • 3
  • 13

2 Answers2

2

One way to go, based on your second attempt, would be as follows:

import pandas as pd
import numpy as np

d = {'obj1': {'color': 'red', 'size': 'large', 'description': 'a large red ball'},
     'obj2': {'color': 'blue', 'size': 'small', 'description': 'a small blue ball'}
     }

# just adding `obj3` here to supply a `KeyError`
data = {'key': {0: 'obj1', 1: 'obj2', 2: 'obj3'},
 'id_num': {0: 143, 1: 139, 2: 140},
 'source': {0: 'loc1', 1: 'loc1', 2: 'loc1'}}

df = pd.DataFrame(data)

df[['size','description']] = df['key'].map(lambda x: [d[x]['size'], d[x]['description']] if x in d else [np.nan]*2).tolist()

print(df)

    key  id_num source   size        description
0  obj1     143   loc1  large   a large red ball
1  obj2     139   loc1  small  a small blue ball
2  obj3     140   loc1    NaN                NaN
ouroboros1
  • 9,113
  • 3
  • 7
  • 26
  • 1
    Brilliant. Thank you! This solves both problems in one line; thanks for teaching me something new! – Jess Aug 05 '22 at 17:21
  • I might have spoken too soon. This was working fine in my Jupyter notebook (v6.4.11), but inside a script it gives me "slice of a copy" warnings. Changing df[['size','description']] to df.loc[:,('size','description')] does not fix it. It prints out this other bit in the warning: isetter(loc, value[:, i].tolist()), but only points me to user guide's 'returning-a-view-versus-a-copy'. I'm a little stumped on why tolist() is a problem. Any ideas? – Jess Aug 05 '22 at 20:27
  • I trust the warning doesn’t have to do with the tolist() assignment. Try assigning anything else - e.g. df[['size','description']] = 1 - and you should get the same warning. Problem is likely related to how your df was constructed earlier in your script. The dummy example here shouldn’t throw an error in any python compiler. So, in your script, the df is probably a **view** of another df, not a **copy**. I found these posts useful on the matter: https://stackoverflow.com/a/67025721/18470692 and https://stackoverflow.com/a/53954986/18470692. Let me know if you’re unable to resolve the issue. – ouroboros1 Aug 06 '22 at 06:48
0

You can create a dataframe from the dictionary and then do .merge:

df = df.merge(
    pd.DataFrame(d.values(), index=d.keys())[["size", "description"]],
    left_on="key",
    right_index=True,
    how="left",
)
print(df)

Prints:

    key  id_num source   size        description
0  obj1     143   loc1  large   a large red ball
1  obj2     139   loc1  small  a small blue ball
2  obj3     140   loc1    NaN                NaN

Data used:

d = {
    "obj1": {
        "color": "red",
        "size": "large",
        "description": "a large red ball",
    },
    "obj2": {
        "color": "blue",
        "size": "small",
        "description": "a small blue ball",
    },
}

data = {
    "key": {0: "obj1", 1: "obj2", 2: "obj3"},
    "id_num": {0: 143, 1: 139, 2: 140},
    "source": {0: "loc1", 1: "loc1", 2: "loc1"},
}
df = pd.DataFrame(data)
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91