How to prevent pandas from prefixing the cols when normalizing a d/json?

Question

My input is a Python dictionnary :

d = {
  "username": "foo999",
  "email": "bar999@example.com",
  "address": {
    "city": "Faketown",
    "state": "Fakeshire"
  },
  "is_premium_user": True,
}

Using pd.json_normalize(d, record_prefix=False), I'm still getting the prefix like in here : address.city

  username               email  is_premium_user address.city address.state
0   foo999  bar999@example.com             True     Faketown     Fakeshire

My expected output is :

  username               email  is_premium_user         city         state
0   foo999  bar999@example.com             True     Faketown     Fakeshire

The code below works but seems hacky and just a workaround.. Also, it suppose I know all the nested keys..

df.columns = df.columns.str.replace("(?:address|2nd_key|...)\.", "", regex=True)

https://stackoverflow.com/questions/39899005/how-to-flatten-a-pandas-dataframe-with-some-columns-as-json does this help? — PV8, Jul 19 '23 at 12:18
I quite sure it's not possible, else how would you handle identical subkeys? `pd.json_normalize({"address": {"city": "Faketown"}, "destination": {"city": "Othertown"}})` — mozway, Jul 19 '23 at 12:24
Thanks @mozway, that's an important question but fortunately, the json we receive is designed in a way that garantees the fields (including the embeded ones) to be unique. — , Jul 19 '23 at 12:27

score 0 · Answer 1 · answered Jul 19 '23 at 12:22

I don't see a way to do this without first creating the prefixed columns, probably since there could be duplicate column names. But if you know there aren't any, a way to fix it afterwards without knowing all of the nested keys would be to use:

df.columns = [col.split(".")[-1] for col in df.columns]

mozway · Accepted Answer · 2023-07-19T13:57:58.473

0

I'm quite sure it's not possible. This avoids ambiguity in case subkeys are identical:

pd.json_normalize({"address": {"city": "Faketown"},
                   "destination": {"city": "Othertown"}})

Some other workarounds:

df = (pd.json_normalize(d, record_prefix='X')
        .rename(columns=lambda x: x.split('.')[-1])
     )

Or:

df = pd.json_normalize(d, record_prefix='X')
df.columns = df.columns.str.replace(r'.*\.', '', regex=True)

unnesting the dictionary manually

If you know that a nested key will never be identical to another one, then you could also use a recursive function to pre-process your input and flatten it:

def unnest(d):
    return {k2: v2 for k, v in d.items() for k2, v2 in
            (unnest(v).items() if isinstance(v, dict) else [[k, v]])}

df = pd.json_normalize(unnest(d))

# or
df = pd.DataFrame(unnest(d), index=[0])

Intermediate:

unnest(d)

{'username': 'foo999',
 'email': 'bar999@example.com',
 'city': 'Faketown',
 'state': 'Fakeshire',
 'is_premium_user': True}


# example with deeper nesting
unnest({'A': 1, 'B': {'C': 2, 'D': 3, 'E': {'F': 4}}})

{'A': 1, 'C': 2, 'D': 3, 'F': 4}

edited Jul 19 '23 at 13:57

answered Jul 19 '23 at 12:28

mozway

194,879
13
39
75

I'm confused about record_prefix because even `record_prefix='X'` seems to have no effect on the column names (besides removing the prefix) but thank you so much for the two options (I like the first one). – Jul 19 '23 at 12:33
1

@Rendezvous `record_prefix` makes sense when you have lists, not dictionaries as items. See for example `pd.json_normalize({'A': [[1, 2], [3, 4]]}, record_path='A', record_prefix='Prefix.')` vs `pd.json_normalize({'A': [[1, 2], [3, 4]]}, record_path='A')` – mozway Jul 19 '23 at 13:11
1

I added an alternative to pre-process the dictionary – mozway Jul 19 '23 at 13:58
Thank you so much! I'll try to break it down to understand the process. – Jul 19 '23 at 14:52

How to prevent pandas from prefixing the cols when normalizing a d/json?

2 Answers2

unnesting the dictionary manually