Extracting portions of JSON string column containing multiple rows and columns in pandas

Question

I have a dataframe where parameters column is JSON and contains multiple actual rows and columns:

input_data = pandas.DataFrame({'id':['0001','0002','0003'],
                               'parameters':["{'product':['book','cat','fish'],'person':['me','you']}",
                                             "'{'product':['book','cat'],'person':['me','you','us']}'",
                                             "'{'product':['apple','snake','rabbit','octopus'],'person':['them','you','us','we','they']}'"]})

... from which I'd like to extract the following data frames:

product_data = pandas.DataFrame({'id':['0001','0001','0001','0002','0002','0003','0003','0003','0003'],
                                'product':['book','cat','fish','book','cat','apple','snake','rabbit','octopus']})


person_data = pandas.DataFrame({'id':['0001','0001','0002','0002','0002','0003','0003','0003','0003','0003'],
                                'person':['me','you','me','you','us','them','you','us','we','they']})

Below is how I've utilized Regular Expressions to get me there. I doubt this is the best way to do it but here it goes:

for i in input_data.id.tolist():
    s = ''.join(input_data[input_data.id == i]['parameters'])
    product_string = re.search(r"product':(.*?),'person", str(s)).group(1)
    product_data = pandas.DataFrame(product_string[1:-1].split(','))
    person_string = re.search(r"person':(.*?)}", str(s)).group(1)
    person_data = pandas.DataFrame(person_string[1:-1].split(','))
    print("........")
    print(product_data)
    print("........")
    print(person_data)

I'd like to learn a faster, more elegant, or wholesome solution that may capture unexpected nuances.

It helps if you say *"`parameters` is a JSON string column containing multiple rows and columns in pandas"*; I edited this and tagged [tag:json]. There are tons of existing questions on extracting/parsing JSON in pandas. Really you should start from doing [`read_json`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_json.html) on your input JSON, not `read_csv`, to avoid ever having to manually extract this stuff. (Can you show us a snippet of your JSON input file? Without that link this question is not [MCVE](http://stackoverflow.com/help/mcve)) — smci, Jul 24 '19 at 22:53
Possible duplicate of [Loading a file with more than one line of JSON into Pandas](https://stackoverflow.com/questions/30088006/loading-a-file-with-more-than-one-line-of-json-into-pandas) — smci, Jul 24 '19 at 23:01
Duplicates like [Loading a file with more than one line of JSON into Pandas](https://stackoverflow.com/questions/30088006/loading-a-file-with-more-than-one-line-of-json-into-pythons-pandas), [this](https://stackoverflow.com/questions/20037430/reading-multiple-json-records-into-a-pandas-dataframe). [this](https://stackoverflow.com/questions/39257147/convert-pandas-dataframe-to-json-format) and [many others](https://stackoverflow.com/search?q=%5Bpandas%5D+JSON+votes%3A10) — smci, Jul 24 '19 at 23:03
Please read the [`read_json` doc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_json.html) — smci, Jul 24 '19 at 23:04

score 2 · Answer 1 · answered Jul 24 '19 at 22:41

2

First, setup yor products and persons using str.get accessor

input_data['products'] = input_data.parameters.str.get('product')

Now, for pandas >= 0.25.0, you may use the explode method

input_data.explode('products')

for pandas <= 0.25.0, you may refer to this thread

I assumed you have dictionaries in your data frames, and not strings as you exposed here.

If you have strings, you may always

import ast
input_data.parameters.apply(ast.literal_eval)

to make them real dictionaries.

answered Jul 24 '19 at 22:41

rafaelc

57,686
15
58
82

1

`input_data[['id','products']].explode('products')` – BENY Jul 24 '19 at 22:47
input_data['products'] = input_data.parameters.str.get('product') creates the column with each field being 'nan'. Is this expected? – BlackHat Jul 24 '19 at 22:52
input_data.parameters.apply(ast.literal_eval) yields invalid syntax error – BlackHat Jul 24 '19 at 22:54
1

Reading JSON is a job for `read_json`; recommending `ast.literal_eval` is total overkill. The OP should never have read in mangled JSON via `read_csv` (or whatever) in the first place. – smci Jul 24 '19 at 23:28

score 0 · Accepted Answer · answered Jul 24 '19 at 23:09

Given the weird structure of the strings in row 2 and 3 and the final output desired below is one version:

input_data = pd.DataFrame({'id':['0001','0002','0003'],
                               'parameters':["{'product':['book','cat','fish'],'person':['me','you']}",
                                             "'{'product':['book','cat'],'person':['me','you','us']}'",
                                             "'{'product':['apple','snake','rabbit','octopus'],'person':['them','you','us','we','they']}'"]})

input_data['parameters'] = input_data['parameters'].str.replace("'{", '{').str.replace("'{", '{').str.replace("}'", '}')
input_data = input_data.join(pd.DataFrame(input_data['parameters'].apply(literal_eval).values.tolist()))

Get length of objects for later input ids

products_len = input_data['product'].apply(len).values
persons_len = input_data['person'].apply(len).values

Spin each result as separate `df`

## flatten x into a list of dictionaries
values = input_data['person'].values.flatten().tolist()
flat_results = [item for sublist in values for item in sublist]

## reinsert a and b
person_df = pd.DataFrame(flat_results, columns = ['person'])


## flatten x into a list of dictionaries
values = input_data['product'].values.flatten().tolist()
flat_results = [item for sublist in values for item in sublist]

## reinsert a and b
product_df = pd.DataFrame(flat_results, columns = ['product'])

Append back the ids:

## person
ids = input_data['id'].repeat(persons_len).reset_index(drop=True)
person_df = person_df.join(ids)

## product
ids = input_data['id'].repeat(products_len).reset_index(drop=True)
product_df = product_df.join(ids)

Result

person_df
Out[57]: 
  person    id
0     me  0001
1    you  0001
2     me  0002
3    you  0002
4     us  0002
5   them  0003
6    you  0003
7     us  0003
8     we  0003
9   they  0003

product_df
Out[58]: 
   product    id
0     book  0001
1      cat  0001
2     fish  0001
3     book  0002
4      cat  0002
5    apple  0003
6    snake  0003
7   rabbit  0003
8  octopus  0003

Extracting portions of JSON string column containing multiple rows and columns in pandas

2 Answers2

Get length of objects for later input ids

Spin each result as separate df

Append back the ids:

Result

Spin each result as separate `df`