1

My input is a Python dictionnary :

d = {
  "username": "foo999",
  "email": "bar999@example.com",
  "address": {
    "city": "Faketown",
    "state": "Fakeshire"
  },
  "is_premium_user": True,
}

Using pd.json_normalize(d, record_prefix=False), I'm still getting the prefix like in here : address.city

  username               email  is_premium_user address.city address.state
0   foo999  bar999@example.com             True     Faketown     Fakeshire

My expected output is :

  username               email  is_premium_user         city         state
0   foo999  bar999@example.com             True     Faketown     Fakeshire

The code below works but seems hacky and just a workaround.. Also, it suppose I know all the nested keys..

df.columns = df.columns.str.replace("(?:address|2nd_key|...)\.", "", regex=True)
vatbub
  • 2,713
  • 18
  • 41
  • https://stackoverflow.com/questions/39899005/how-to-flatten-a-pandas-dataframe-with-some-columns-as-json does this help? – PV8 Jul 19 '23 at 12:18
  • Thanks @PV8 but it does not. –  Jul 19 '23 at 12:20
  • 1
    I quite sure it's not possible, else how would you handle identical subkeys? `pd.json_normalize({"address": {"city": "Faketown"}, "destination": {"city": "Othertown"}})` – mozway Jul 19 '23 at 12:24
  • 1
    Thanks @mozway, that's an important question but fortunately, the json we receive is designed in a way that garantees the fields (including the embeded ones) to be unique. –  Jul 19 '23 at 12:27

2 Answers2

0

I don't see a way to do this without first creating the prefixed columns, probably since there could be duplicate column names. But if you know there aren't any, a way to fix it afterwards without knowing all of the nested keys would be to use:

df.columns = [col.split(".")[-1] for col in df.columns]
lamdoug
  • 33
  • 4
0

I'm quite sure it's not possible. This avoids ambiguity in case subkeys are identical:

pd.json_normalize({"address": {"city": "Faketown"},
                   "destination": {"city": "Othertown"}})

Some other workarounds:

df = (pd.json_normalize(d, record_prefix='X')
        .rename(columns=lambda x: x.split('.')[-1])
     )

Or:

df = pd.json_normalize(d, record_prefix='X')
df.columns = df.columns.str.replace(r'.*\.', '', regex=True)

unnesting the dictionary manually

If you know that a nested key will never be identical to another one, then you could also use a recursive function to pre-process your input and flatten it:

def unnest(d):
    return {k2: v2 for k, v in d.items() for k2, v2 in
            (unnest(v).items() if isinstance(v, dict) else [[k, v]])}

df = pd.json_normalize(unnest(d))

# or
df = pd.DataFrame(unnest(d), index=[0])

Intermediate:

unnest(d)

{'username': 'foo999',
 'email': 'bar999@example.com',
 'city': 'Faketown',
 'state': 'Fakeshire',
 'is_premium_user': True}


# example with deeper nesting
unnest({'A': 1, 'B': {'C': 2, 'D': 3, 'E': {'F': 4}}})

{'A': 1, 'C': 2, 'D': 3, 'F': 4}
mozway
  • 194,879
  • 13
  • 39
  • 75
  • I'm confused about record_prefix because even `record_prefix='X'` seems to have no effect on the column names (besides removing the prefix) but thank you so much for the two options (I like the first one). –  Jul 19 '23 at 12:33
  • 1
    @Rendezvous `record_prefix` makes sense when you have lists, not dictionaries as items. See for example `pd.json_normalize({'A': [[1, 2], [3, 4]]}, record_path='A', record_prefix='Prefix.')` vs `pd.json_normalize({'A': [[1, 2], [3, 4]]}, record_path='A')` – mozway Jul 19 '23 at 13:11
  • 1
    I added an alternative to pre-process the dictionary – mozway Jul 19 '23 at 13:58
  • Thank you so much! I'll try to break it down to understand the process. –  Jul 19 '23 at 14:52