We face the challenge that we have a dataframe patents_original
which columns consist of lists of dictionaries. Each dictionary contains reoccurring keys like 'inventor_last_name'
.
[{'inventor_last_name': 'Han', 'inventor_first_name': 'Shu-Jen', 'inventor_country': 'US', 'inventor_key_id': '104654'}, {'inventor_last_name': 'Chen', 'inventor_first_name': 'Chia-Yu', 'inventor_country': 'US', 'inventor_key_id': '367934'}]
We are seeking columns only containing the items of the recurring keys. So that every 'inventor_last_name'
of a row is listed in a new column called 'inventor_last_name'
(When multiple investors are contained, all of their last names should be listed in one column called 'inventor_last_name'). For the following analysis it is very important that the line affiliation is not changed. The new dataframe should afterwards contain 4 new columns called 'inventor_last_name'
, 'inventor_first_name'
, 'inventor_country'
, and 'inventor_key_id'
(the keys of the prior dictionaries).
Stack overflow provided this code fragment to create a new column and fill it with the items of the key "inventor_last_name"
:
patents_inventors["inventor_last_name"] = [sub_dict["inventor_last_name"] for sub_dict in patents_inventors["inventors"]]
The following error occurs:
patents_inventors["inventor_last_name"] = [ sub_dict["inventor_last_name"] for sub_dict in patents_inventors["inventors"]]
Traceback (most recent call last):
File "<ipython-input-46-2c776eb8a76d>", line 1, in <module>
patents_inventors["inventor_last_name"] = [ sub_dict["inventor_last_name"] for sub_dict in patents_inventors["inventors"]]
File "<ipython-input-46-2c776eb8a76d>", line 1, in <listcomp>
patents_inventors["inventor_last_name"] = [ sub_dict["inventor_last_name"] for sub_dict in patents_inventors["inventors"]]
TypeError: string indices must be integers
For us, as absolute Python amateurs, it seems like Python does not interpret the line as a list of dictionaries. All attempts to transfer the column into the datatype list or dictionary have failed.
If it helps, I can provide you with the link to the editable excel sheet!
This is our first post on Stack overflow, so please bear with us if it reads like this. We really appreciate your help!
Edit:
The column names are:
patent_number
patent_title
patent_abstract
patent_date
patent_num_combined_citations
patent_num_cited_by_us_patents
inventors
assignees
applications
cited_patents
citedby_patents
cpcs
wipos
Since I am not quite sure how to shorten the dataframe to post it here, I am linking a picture to the corresponding excel sheet. Hope this helps too.