Pandas Dataframe: Splitting Columns containing lists of dictionaries

Question

We face the challenge that we have a dataframe patents_original which columns consist of lists of dictionaries. Each dictionary contains reoccurring keys like 'inventor_last_name'.

[{'inventor_last_name': 'Han', 'inventor_first_name': 'Shu-Jen', 'inventor_country': 'US', 'inventor_key_id': '104654'}, {'inventor_last_name': 'Chen', 'inventor_first_name': 'Chia-Yu', 'inventor_country': 'US', 'inventor_key_id': '367934'}]

We are seeking columns only containing the items of the recurring keys. So that every 'inventor_last_name' of a row is listed in a new column called 'inventor_last_name' (When multiple investors are contained, all of their last names should be listed in one column called 'inventor_last_name'). For the following analysis it is very important that the line affiliation is not changed. The new dataframe should afterwards contain 4 new columns called 'inventor_last_name', 'inventor_first_name', 'inventor_country', and 'inventor_key_id' (the keys of the prior dictionaries).

Stack overflow provided this code fragment to create a new column and fill it with the items of the key "inventor_last_name":

patents_inventors["inventor_last_name"] = [sub_dict["inventor_last_name"] for sub_dict in patents_inventors["inventors"]]

The following error occurs:

patents_inventors["inventor_last_name"] = [ sub_dict["inventor_last_name"] for sub_dict in patents_inventors["inventors"]]
Traceback (most recent call last):

  File "<ipython-input-46-2c776eb8a76d>", line 1, in <module>
    patents_inventors["inventor_last_name"] = [ sub_dict["inventor_last_name"] for sub_dict in patents_inventors["inventors"]]

  File "<ipython-input-46-2c776eb8a76d>", line 1, in <listcomp>
    patents_inventors["inventor_last_name"] = [ sub_dict["inventor_last_name"] for sub_dict in patents_inventors["inventors"]]

TypeError: string indices must be integers

For us, as absolute Python amateurs, it seems like Python does not interpret the line as a list of dictionaries. All attempts to transfer the column into the datatype list or dictionary have failed.

If it helps, I can provide you with the link to the editable excel sheet!

This is our first post on Stack overflow, so please bear with us if it reads like this. We really appreciate your help!

Edit:

The column names are:

patent_number
patent_title
patent_abstract
patent_date
patent_num_combined_citations
patent_num_cited_by_us_patents
inventors
assignees
applications
cited_patents
citedby_patents
cpcs
wipos

Since I am not quite sure how to shorten the dataframe to post it here, I am linking a picture to the corresponding excel sheet. Hope this helps too.

exerpt of the corresponding excel sheet

It's not clear how the dataframe is actually structured. Please clarify, add a little sample, etc. — Timus, Jul 29 '21 at 10:05
The dataframe itself has the shape `[36015 rows x 14 columns]` and contains patent data mainly in the form of objects. Since the columns are structured in a similar way as the inventors this column is describing for the remaining df. Hope this helps! — Patrick Müller, Jul 29 '21 at 10:22
What are the column names? It would be best, if you could create a small articfical sample that replicates the problem. Make it copy-paste-able. — Timus, Jul 29 '21 at 10:23
Hi Timus, I am not familiar with posting df on stack overflow and since the df is that big https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples could not help me that much. I edited the question, hope that helps. Thank you in advance! — Patrick Müller, Jul 29 '21 at 10:53

Joeri Beckers · Answer 1 · 2021-07-29T12:03:26.023

0

I personally prefer using apply in these situations because it increases readability. From your question I was not sure if it was a nested list.

Update: The string type means that you first need to parse the object as a dictionary.

import json
patents_inventors['parsed_inventors'] = patents_inventors['inventors'].apply(lambda x: [json.loads(y) for y in x])

patents_inventors['inventor_last_name'] = patents_inventors['parsed_inventors'].apply(lambda x: [y['inventor_last_name'] for y in x])

edited Jul 29 '21 at 12:03

answered Jul 29 '21 at 10:13

Joeri Beckers

31
2

Both code fragments result in the same error. `TypeError: string indices must be integers` I guess one of our main problems is, that we are not quite sure about the column's structure. The `.dtypes` reveals that it is an object but since we are quite new to python we are not sure how to handle this information and can't convert it to a different type like a list. – Patrick Müller Jul 29 '21 at 10:18
Check the type of the object because maybe it's a string and you have to parse it first – Joeri Beckers Jul 29 '21 at 10:24
This is how the output of the type function looks like. `[, , , , ... , ]` I am not quite sure how to interpret this information. For your information the dataframe was recieved by an api request from patentsview. – Patrick Müller Jul 29 '21 at 10:56
Dear Joeri, the problem has changed another time. The new Error looks as follows `TypeError: the JSON object must be str, bytes or bytearray, not dict` after we have changed the way of JSON decoding from `results = r.json` to `results = json.loads(r.text)`. Do you have any clue about how to handle this problem? Thank you for your help! – Patrick Müller Aug 03 '21 at 10:52

Pandas Dataframe: Splitting Columns containing lists of dictionaries

1 Answers1