7

I have a Pandas dataframe in which one column contains JSON data (the JSON structure is simple: only one level, there is no nested data):

ID,Date,attributes
9001,2020-07-01T00:00:06Z,"{"State":"FL","Source":"Android","Request":"0.001"}"
9002,2020-07-01T00:00:33Z,"{"State":"NY","Source":"Android","Request":"0.001"}"
9003,2020-07-01T00:07:19Z,"{"State":"FL","Source":"ios","Request":"0.001"}"
9004,2020-07-01T00:11:30Z,"{"State":"NY","Source":"windows","Request":"0.001"}"
9005,2020-07-01T00:15:23Z,"{"State":"FL","Source":"ios","Request":"0.001"}"

My Pandas dataframe

I would like to normalize the JSON content in the attributes column so the JSON attributes become each a column in the dataframe.

ID,Date,attributes.State, attributes.Source, attributes.Request
9001,2020-07-01T00:00:06Z,FL,Android,0.001
9002,2020-07-01T00:00:33Z,NY,Android,0.001
9003,2020-07-01T00:07:19Z,FL,ios,0.001
9004,2020-07-01T00:11:30Z,NY,windows,0.001
9005,2020-07-01T00:15:23Z,FL,ios,0.001 

I have been trying using Pandas json_normalize which requires a dictionary. So, I figure I would convert the attributes column to a dictionary but it does not quite work out as expected for the dictionary has the form:

df.attributes.to_dict()

{0: '{"State":"FL","Source":"Android","Request":"0.001"}',
 1: '{"State":"NY","Source":"Android","Request":"0.001"}',
 2: '{"State":"FL","Source":"ios","Request":"0.001"}',
 3: '{"State":"NY","Source":"windows","Request":"0.001"}',
 4: '{"State":"FL","Source":"ios","Request":"0.001"}'}

And the normalization takes the key (0, 1, 2, ...) as the column name instead of the JSON keys.

I have the feeling that I am close but I can't quite work out how to do this exactly. Any idea is welcome.

Thank you!

Wilmar
  • 558
  • 1
  • 5
  • 16

4 Answers4

4

Normalize expects to work on an object, not a string.

import json
import pandas as pd
df_final = pd.json_normalize(df.attributes.apply(json.loads))
Garet Jax
  • 1,091
  • 3
  • 17
  • 37
  • This did the trick for me. I had a _.tsv_ file that I read using _read_csv_ method. Then when I was passing the _df_ to _json_normalize_, but it was just outputting the indexes. My column was just a string. I converted to an object and voila! Thanks – yaach Aug 05 '21 at 00:16
1

You shouldn’t need to convert to a dictionary first.

Try:

import pandas as pd

pd.json_normalize(df[‘attributes’])
  • 1
    Hi @Owen. I tried already but I get an error: AttributeError: 'str' object has no attribute 'values It is the same if I try pd.json_normalize(df[‘attributes’].values) – Wilmar Aug 26 '20 at 20:59
  • How is the data imported? – cutting_shapes Aug 26 '20 at 21:06
  • Hi @Owen, It is just a simple `import pandas as pd` `df = pd.read_csv('datafile.csv')` – Wilmar Aug 26 '20 at 21:13
  • Not sure if this might help, seems like a similar scenario. Apologies if not: https://stackoverflow.com/questions/49671693/pandas-dataframe-normalize-one-json-column-and-merge-with-other-columns – cutting_shapes Aug 27 '20 at 08:15
  • Thank you @Owen. That is an interesting thread. It does not fit what I need. The data I have is not a group of dictionaries and getting it to be leaves me in a very simmilar position to the original question. But I do appreciate the effort. Thank you! – Wilmar Aug 27 '20 at 12:45
0

I found an solution but I am not overly happy with it. I reckon it is very inefficient.

import pandas as pd
import json

# Import full dataframe
df = pd.read_csv(r'D:/tmp/sample_simple.csv', parse_dates=['Date'])

# Create empty dataframe to hold the results of data conversion
df_attributes = pd.DataFrame()

# Loop through the data to fill the dataframe
for index in df.index:
    row_json = json.loads(df.attributes[index])
    normalized_row = pd.json_normalize(row_json)
    # df_attributes = df_attributes.append(normalized_row) (deprecated method) use concat instead
      df_attributes = pd.concat([df_attributes, normalized_row], ignore_index=True) 

# Reset the index of the attributes dataframe
df_attributes = df_attributes.reset_index(drop=True)

# Drop the original attributes column
df = df.drop(columns=['attributes'])

# Join the results
df_final = df.join(df_attributes)

# Show results
print(df_final)
print(df_final.info())

Which gives me the expected result. However, as I said, there are several inefficiencies in it. For starters, the dataframe append in the for loop. According to the documentation the best practice is to make a list and then append but I could not figure out how to do that while keeping the shape I wanted. I welcome all critics and ideas.

Community
  • 1
  • 1
Wilmar
  • 558
  • 1
  • 5
  • 16
0

You can achieve the desired output in one line:

df = pd.concat([df[['ID', 'Date']], pd.json_normalize(df['attributes'])], axis=1)