4

I have a dataframe that looks like this:

ID       phone_numbers
1        [{u'updated_at': u'2017-12-02 15:29:54', u'created_at': u'2017-12-0 
          2 15:29:54', u'sms': 0, u'number': u'1112223333', u'consumer_id': 
          12345, u'organization_id': 1, u'active': 1, u'deleted_at': 
           None, u'type': u'default', u'id': 1234}]

I want to take the phone_numbers column and flatten the information inside of it so I can query say the 'id' field.

When I try;

json_normalize(df.phone_numbers)

I get error:

AttributeError: 'str' object has no attribute 'itervalues'

I am not sure why this error is being produced and why I can not flatten this column.

EDIT:

originally JSON string being read from a response object(r.text):

https://docs.google.com/document/d/1Iq4PMcGXWx6O48sWqqYnZjG6UMSZoXfmN1WadQLkWYM/edit?usp=sharing

EDIT:

Converted a column I need to flatten into JSON through this command

a = df.phone_numbers.to_json()

{"0":[{"updated_at":"2018-04-12 12:24:04","created_at":"2018-04-12 12:24:04","sms":0,"number":"","consumer_id":123,"org_id":123,"active":1,"deleted_at":null,"type":"default","id":123}]}
smci
  • 32,567
  • 20
  • 113
  • 146
RustyShackleford
  • 3,462
  • 9
  • 40
  • 81
  • One idea, is DataFrame created from `json` ? Maybe better is use `df = json_normalize(json)` – jezrael Jul 03 '18 at 11:50
  • I tried that since I am getting a json response through a API. but I am getting the same error as above – RustyShackleford Jul 03 '18 at 12:17
  • OK, is possible add sample of json? – jezrael Jul 03 '18 at 12:18
  • @jezrael, the JSON response has patient information, there is no way I can post that, even with scrubbing I am scared something maybe missed. However the error, I get when passing the json object into json_normalize is an attribute error ''unicode' object has no attribute 'itervalues' – RustyShackleford Jul 03 '18 at 12:21
  • It looks like each record in `phone_numbers` column is a list of length 1 of JSON strings. Rather than just a JSON string. Check the exact contents to be sure. – smci May 24 '19 at 23:14

5 Answers5

6

Use list comprehension with flatenning and adding new element ID to dictionary:

df = pd.DataFrame({'ID': [1, 2], 'phone_numbers': [[{'a': '2017', 'b': '2017', 'sms': 1}, 
                                                    {'a': '2018', 'b': '2017', 'sms': 2}], 
                                                  [{'a': '2017', 'b': '2017', 'sms': 3}]]})
print (df)
   ID                                      phone_numbers
0   1  [{'a': '2017', 'b': '2017', 'sms': 1}, {'a': '...
1   2             [{'a': '2017', 'b': '2017', 'sms': 3}]

df = pd.DataFrame([dict(y, ID=i) for i, x in df.values.tolist() for y in x])
print (df)  

   ID     a     b  sms
0   1  2017  2017    1
1   1  2018  2017    2
2   2  2017  2017    3

EDIT:

df = pd.DataFrame({'phone_numbers':{"0":[{"type":"default","id":123}]}})

df = pd.DataFrame([y for x in df['phone_numbers'].values.tolist() for y in x])
print (df) 
    id     type
0  123  default
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • sorry for the delay in response. When I ran your above line, I got the error 'too many values to unpack'. I think the issue maybe that the ID field is embedded in the nested json and not as a seperate column. Do you think thats why your code is not working for me? – RustyShackleford Jul 04 '18 at 14:59
  • After introducing the ID column, I am still getting the error 'too many values to unpack' – RustyShackleford Jul 04 '18 at 15:07
  • @RustyShackleford - Is possible change input data? The best sample `json` with anonymize data. – jezrael Jul 04 '18 at 15:08
  • 1
    sure I will anonymize the input data give me some time to clean it up thoroughly – RustyShackleford Jul 04 '18 at 15:13
  • I have uploaded a document to drive. I wrote the JSON string from the API call into a word doc, replaced all fields with the word "FIELD". Please let me know if that helps. Also I could not find the ending of one record and beginning of the next, so its the entire string. – RustyShackleford Jul 04 '18 at 16:41
  • @RustyShackleford - a bit crazy sample, thank you. I edit your json, because not valid - some `""` was missing and first and last character was `"`. Also for smallier data size was replaced `FIELD` to `i` - [sample data](https://docs.google.com/document/d/1-h1uGPwYaAob8UlM8BrHGm7JxINOBLwECjZqYN0wNzs/edit?usp=sharing) – jezrael Jul 05 '18 at 05:54
  • @RustyShackleford - Then was applied [this solution](https://stackoverflow.com/a/40589031/2901002), but output DataFrame is really nested. I use `with open('A.txt') as data_file: data = json.load(data_file)` and then `df1= json_normalize(data)` or `df2 = json_normalize(data, 'iii', ['ii','iii_iiiiii'], meta_prefix='_')` or `df3 = json_normalize(data, 'iiiiiiiiii', ['ii','iii_iiiiii'], meta_prefix='_')`. Ca you specify which data need parsied, which column form this sample data? Thank you. – jezrael Jul 05 '18 at 05:57
  • And no problem with delay :) – jezrael Jul 05 '18 at 06:00
  • I think I have made a discovery. The dataframe I have been reading in is from CSV. The nested JSON column has double quotes being added it to like this, " [{}] ". However in production this will not be the case and when I run the dataframe which I want to unnest through json_normalize like so (without the quotes), 'json_normalize(df.phone_numbers[0])', the row unnests. However I can not run this on every row in the column – RustyShackleford Jul 05 '18 at 11:51
  • @RustyShackleford - Is possible convert data to json? Because not sure if understand :( – jezrael Jul 05 '18 at 11:59
  • i added the json data. – RustyShackleford Jul 05 '18 at 12:08
  • So `id` column is not necessary? Only need Dataframe from `df.phone_numbers` column? – jezrael Jul 05 '18 at 12:12
  • I tried pushing the phone_numbers column to a list but still get same error where json_normalize can not iterate over the list – RustyShackleford Jul 05 '18 at 12:13
  • @RustyShackleford - Can you check solution? – jezrael Jul 05 '18 at 12:25
  • @RustyShackleford - Supeeeeeeeeeeeer :) – jezrael Jul 05 '18 at 12:39
5

I am not sure but I think that json normalize expect as first argument a json, not a pd.series, convert the series to a dict or list of dict first. You could use to_dict()

json_normalize(df.phone_numbers.to_dict())
user96564
  • 1,578
  • 5
  • 24
  • 42
alvaro nortes
  • 570
  • 4
  • 10
1

Team:

I think I found the answer to this one. If you dig; you'll see that the output of pd.json_normalize after the first iteration is a STRING. I had success by tricking pandas back into thinking it was a list of dictionaires.

fb_customers = test_schema_a["FBCustomers"].to_list()
fb_customers_b = list(itertools.chain(*fb_customers))
test_schema_b = pd.DataFrame(fb_customers_b)
Nic F
  • 81
  • 4
1

Simple one liner to flatten a json column / series in pandas.

import json
import pandas as pd

df_flattened_col = pd.json_normalize(df['JSON_COLUMN'].apply(json.loads).tolist())

⭐ Worked for me ⭐

pbjolsby
  • 83
  • 1
  • 6
0

The easiest solution for me was to first load the file with the built-in json package, then use that object with normalize:

Reading from a string:

import json
json_str = '{ "name":"John", "age":30, "city":"New York"}'
data = json.loads(json_str)
df = json_normalize(data)
print(df)

And reading from a file:

import json
                
with open(filepath, 'r') as file:
    data = json.load(file)
    df = json_normalize(data)
    print(df)
mimoralea
  • 9,590
  • 7
  • 58
  • 59