Python: Expand JSON structure in a column into columns in the same dataframe

Question

I have a column consists of JSON structured data. My df looks like this:

ClientToken                      Data
7a9ee887-8a09-ff9592e08245       [{"summaryId":"4814223456","duration":952,"startTime":1587442919}]
bac49563-2cf0-cb08e69daa48       [{"summaryId":"4814239586","duration":132,"startTime":1587443876}]

I want to expand it to:

ClientToken                      summaryId         duration           startTime
7a9ee887-8a09-ff9592e08245       4814223456             952           1587442919
bac49563-2cf0-cb08e69daa48       4814239586             132           1587443876`

Any ideas?

"My df looks like this" is df an instance of pandas dataframe? if no, what is df an instance of? — punter147, Apr 29 '20 at 09:37

Alexandre B. · Accepted Answer · 2020-04-29T16:51:46.680

2

You can try:

df[["ClientToken"]].join(df.Data.apply(lambda x: pd.Series(json.loads(x[1:-1]))))

Explanations:

Select the Data column and apply the following steps:
1. Because the "Data" content is wrapped in a list and this is a string, we can remove [] manually using x[1:-1] (remove first and last character).
2. Since the "Data" column is a string and we actually want a JSON, we need to convert it. One solution is to use the json.loads() function from the json module. The code becomes json.loads(x[1:-1])
3. Then, convert the dictto a pd.Series using pd.Series(json.loads(x[1:-1]))
Add these new columns to the existing dataframe using join. Also, you will notice I used double [] to select the "ClientToken" column as a dataframe.

Code + illustration:

import pandas as pd
import json

# step 1.1
print(df.Data.apply(lambda x: x[1:-1]))
# 0    {"summaryId":"4814223456","duration":952,"star...
# 1    {"summaryId":"4814239586","duration":132,"star...
# Name: Data, dtype: object

# step 1.2
print(df.Data.apply(lambda x: json.loads(x[1:-1])))
# 0    {'summaryId': '4814223456', 'duration': 952, '...
# 1    {'summaryId': '4814239586', 'duration': 132, '...
# Name: Data, dtype: object

# step 1.3
print(df.Data.apply(lambda x: pd.Series(json.loads(x[1:-1]))))
#     summaryId  duration   startTime
# 0  4814223456       952  1587442919
# 1  4814239586       132  1587443876

# step 2
print(df[["ClientToken"]].join(df.Data.apply(lambda x: pd.Series(json.loads(x[1:-1])))))
#                   ClientToken   summaryId  duration   startTime
# 0  7a9ee887-8a09-ff9592e08245  4814223456       952  1587442919
# 1  bac49563-2cf0-cb08e69daa48  4814239586       132  1587443876

Edit 1:

As it seems that there are some rows where the list in Data has multiple dicts, you can try:

df[["ClientToken"]].join(df.Data.apply(lambda x: [pd.Series(y)
                                                  for y in json.loads(x)]) \
                    .explode() \
                    .apply(pd.Series))

edited Apr 29 '20 at 16:51

answered Apr 29 '20 at 10:26

Alexandre B.

5,387
2
17
40

Tried this: df2 = df[["ClientToken"]].join(df.Data.apply(lambda x: pd.Series(x[0]))) After running this df2 contained 2 cols: 1) ClientToken 2) Column name 0 with "[" – gtomer Apr 29 '20 at 13:25
Does the `"Data"` is a string representing a JSON or a real JSON ? Obviously, it's a string. I will update the answer – Alexandre B. Apr 29 '20 at 13:27
Thank you Alexandre! Data is text. When looking at the df.info it is an "object" – gtomer Apr 29 '20 at 13:35
I get the following error: JSONDecodeError: Extra data: line 1 column 368 (char 367) – gtomer Apr 29 '20 at 14:16
How does the line looks like ? – Alexandre B. Apr 29 '20 at 14:30
Sorry, what line? – gtomer Apr 29 '20 at 15:37
The problematic line. I suspect there are same `Data` cell where the `list` of `dict` has more than 1 `dict` ? – Alexandre B. Apr 29 '20 at 15:38
df[["ClientToken"]].join(df.Data.apply(lambda x: pd.Series(json.loads(x[1:-1])))) – gtomer Apr 29 '20 at 15:41
I mean in your dataset – Alexandre B. Apr 29 '20 at 15:43
Emmmm. How can I know on which line? – gtomer Apr 29 '20 at 15:45
My original df has Data content up to 3,000 chars long – gtomer Apr 29 '20 at 15:48
Edit1 resulted in: AttributeError: 'Series' object has no attribute 'explode' – gtomer Apr 29 '20 at 15:53
As reported [here](https://github.com/dask/dask/pull/5381/files/2a94bc626903243ac54faa807c87a306fc01d796), you should consider update `pandas` to a version `>0.25`. For more detail on how to upgrade, see [Upgrade version of Pandas](https://stackoverflow.com/questions/37954195/upgrade-version-of-pandas) – Alexandre B. Apr 29 '20 at 15:57
Requirement already up-to-date: pandas in c:\users\user\appdata\local\programs\python\python38\lib\site-packages (1.0.3) – gtomer Apr 29 '20 at 16:06
You want me to send you the df in any way? – gtomer Apr 29 '20 at 16:07
What is the problem now ? – Alexandre B. Apr 29 '20 at 16:14
Edit1 resulted in: AttributeError: 'Series' object has no attribute 'explode' Pandas is already the most updated – gtomer Apr 29 '20 at 16:16
I think you're missing something about the version. Try, in the file you run, to add `print(pd.__version__)` and share the result, thanks – Alexandre B. Apr 29 '20 at 16:22
Well, you are right. The version is 0.24.2. However when I am running 'pip3 install --upgrade pandas' I get: Requirement already up-to-date – gtomer Apr 29 '20 at 16:27
If you have several python version installed, be sure to run the appropriate one. Whatever, you can try to uninstall the current version and install the last one with `pip3 install pandas==1.0.3` – Alexandre B. Apr 29 '20 at 16:33
1

With Pandas > 0.25 this indeed works well!! Many thanks!! – gtomer Apr 29 '20 at 16:44
Glad to help you ! As you seem new on the website, maybe you can have a look at [What should I do when someone answers my question?](https://stackoverflow.com/help/someone-answers). Cheers – Alexandre B. Apr 29 '20 at 16:49

score 0 · Answer 2 · answered Apr 29 '20 at 14:57

An alternative, using defaultdict and ast literal eval:

from collections import defaultdict
import ast
d = defaultdict(list)
#iterate through the Data column and append to dictionary for each key
for ent in df.Data:
    for entry in ast.literal_eval(ent):
        for k, v in entry.items():
            d[k].append(v)

#concat to ClientToken column
pd.concat([df.ClientToken,pd.DataFrame(d)],axis=1)

    ClientToken summaryId   duration    startTime
0   7a9ee887-8a09-ff9592e08245  4814223456  952 1587442919
1   bac49563-2cf0-cb08e69daa48  4814239586  132 1587443876

Python: Expand JSON structure in a column into columns in the same dataframe

2 Answers2