15

I have a Pandas DataFrame where one column is a Series of dicts, like this:

   colA  colB                                  colC
0     7     7  {'foo': 185, 'bar': 182, 'baz': 148}
1     2     8  {'foo': 117, 'bar': 103, 'baz': 155}
2     5    10  {'foo': 165, 'bar': 184, 'baz': 170}
3     3     2  {'foo': 121, 'bar': 151, 'baz': 187}
4     5     5  {'foo': 137, 'bar': 199, 'baz': 108}

I want the foo, bar and baz key-value pairs from the dicts to be columns in my dataframe, such that I end up with this:

   colA  colB  foo  bar  baz
0     7     7  185  182  148
1     2     8  117  103  155
2     5    10  165  184  170
3     3     2  121  151  187
4     5     5  137  199  108

How do I do that?

AllanLRH
  • 1,124
  • 1
  • 12
  • 23
  • 2
    I'm not sure this question is a duplicate of the flagged question. That seems to be about convert *to* a dict and this is getting data out of a dict back into the original dataframe – Phil Nov 17 '20 at 20:01
  • @Phil .. OP doesn't show effort which is one of the reasons why this question is closed. Read: "homework". Pointer -> Book: VanderPlas: Python Data Science Handbook. EoR. – ZF007 Nov 20 '20 at 10:24
  • Just voted to reopen... We'll see what happens. – AllanLRH Nov 20 '20 at 19:20
  • I cannot vote, but this is definitely not a duplicate. I would highly appreciate for the question to be reopened. – Lukas Feb 22 '22 at 13:34
  • 1
    You can use `pd.json_normalize(df["colC"])`. – Carlos Horn Feb 22 '22 at 16:41
  • Good point, I'll update the answer one of the coming days. – AllanLRH Feb 22 '22 at 22:31
  • No, **all** of these questions are about converting a column that contains dicts, into separate columns based off the key-value pairs in those dicts. There's also a much better a canonical, so I went ahead and redirected everything there. – Karl Knechtel Mar 16 '23 at 02:11

2 Answers2

22

TL;DR

Based on Carlos Horn's comment pd.json_normalize are perfect for this:

df_fixed = df.join(pd.json_normalize(df['colC'])).drop('colC', axis='columns')

Old answer

df = df.drop('colC', axis=1).join(pd.DataFrame(df.colC.values.tolist()))

Elaborate (old) answer

We start by defining the DataFrame to work with, as well as importing Pandas:

import pandas as pd


df = pd.DataFrame(
    {
        'colA': {0: 7, 1: 2, 2: 5, 3: 3, 4: 5},
        'colB': {0: 7, 1: 8, 2: 10, 3: 2, 4: 5},
        'colC': {
            0: {'foo': 185, 'bar': 182, 'baz': 148},
            1: {'foo': 117, 'bar': 103, 'baz': 155},
            2: {'foo': 165, 'bar': 184, 'baz': 170},
            3: {'foo': 121, 'bar': 151, 'baz': 187},
            4: {'foo': 137, 'bar': 199, 'baz': 108},
        },
    }
)

The column colC is a pd.Series of dicts, and we can turn it into a pd.DataFrame by turning each dict into a pd.Series:

pd.DataFrame(df.colC.values.tolist())
# df.colC.apply(pd.Series). # this also works, but it is slow

which gives the pd.DataFrame:

   foo  bar  baz
0  154  190  171
1  152  130  164
2  165  125  109
3  153  128  174
4  135  157  188

So all we need to do is:

  1. Turn colC into a pd.DataFrame
  2. Delete the original colC from df
  3. Join the convert colC with df

That can be done in a one-liner:

df = df.drop('colC', axis=1).join(pd.DataFrame(df.colC.values.tolist()))

With the contents of df now being the pd.DataFrame:

   colA  colB  foo  bar  baz
0     2     4  154  190  171
1     4    10  152  130  164
2     4    10  165  125  109
3     3     8  153  128  174
4    10     9  135  157  188
AllanLRH
  • 1,124
  • 1
  • 12
  • 23
  • No, it is slow solution, not recomend use `.apply(pd.Series)`, similar solution with list - check timings from [this](https://stackoverflow.com/a/35491399/2901002) – jezrael Jan 24 '19 at 10:16
  • Sorry, wrong dupe, now added correct. – jezrael Jan 24 '19 at 10:24
  • I changed to answer to use `pd.DataFrame(df.colC.values.tolist())`, although I still mention the `apply`-method as a slower alternative. – AllanLRH Jan 24 '19 at 10:26
  • yes, it is better solution, but unfortunately dupe :( – jezrael Jan 24 '19 at 10:29
  • 1
    Well, I write it because I have never found a clear stackoverflow tutorial for expanding and _replacing_ a columnd-of-dicts, so hopefulle someone will stumble upon this, even of they are redirected to the "original" answer :) And thanks for the heads-up about the perfornamce issue. – AllanLRH Jan 24 '19 at 10:37
  • You are welcome, +1 – jezrael Jan 24 '19 at 10:59
  • @AllanLRH your method fails when some entries in the dictionary-column are empty. Additionally, it does not preserve the row index. I think this is relevant because I believe that in many use cases, when dealing with JSON-objects, you might actually be confronted with lists of dicts - since that makes it necessary to explode the list, it is necessary to preserve row indices. Would you consider changing the answer? Or should I write a new answer? – Lukas Feb 17 '22 at 13:50
  • @Lukas thanks for taking time to help improve this answer :) I've tried removing some of the entries in on of the dicts, and it seems to work fine, just placing NaN's in those positions, which is what I'd expect to happen… alas I'm not sure under which circumstances it's failing? I'd like to update my answer once I understand the problem. As to not preserving the index, changing the indices in the top-level dict in 'colC' does indeed mess things up. If you are up for it, consider submitting an edit to my answer, or just post a new one... though I'm a fan of having a single answer on SO ;) – AllanLRH Feb 20 '22 at 11:35
  • 1
    The problem arises when the entries of the dictionary you want to unpack are lists of dicts, which you then want to use `.explode()` on. Think something like this: `df = pd.DataFrame({'colA': {0: 7}, 'colC': {0: {'foo': [{"position": 71, "value": "a"}, {"position": 10, "value": "b"}], 'bar': [{"position": 91, "value": "d"}, {"position": 11, "value": "c"}]}}})` Because you have to explode the resulting columns, you need to preserve index, but `.tolist()` drops the index. – Lukas Feb 22 '22 at 13:36
  • 1
    Note that the `json_normalize` will reset the index. So unless you reset your index in the original dataset the `join` ruin your data. Be very careful when using the new solution – Coolkau Aug 03 '23 at 07:55
1

I faced the same challenge recently and I managed to do it manually using apply and join.

import pandas as pd

def expand_dict_column(df: pd.DataFrame, column) -> pd.DataFrame:
    df.drop(columns=[column], inplace=False).join(
        df.apply(lambda x: pd.Series(x[column].values(), index=x[column].keys()), axis=1))

In the case of the columns of the question it would look like this:

df.drop(columns=["colC"], inplace=False).join(
    df.apply(lambda x: pd.Series(x["colC"].values(), index=x["colC"].keys()), axis=1))
momvart
  • 1,737
  • 1
  • 20
  • 32