0

I have a column authors in a dataframe df where the values are first in a dictionary and then the dictionary is added to a list. The list in then stored in a column. Such as the following:

[{'family': 'Yaisy',
  'given': 'Lisa',
  'affiliation': [{'name': 'Department of Sciences, Faculty Sciences, University of Science'}]},
 {'family': 'Kite',
  'given': 'Hume',
  'affiliation': [{'name': 'Department of Sciences, Science and Technology'}]},
 {'family': 'Jones',
  'given name': 'Mike',
  'localId': 'aza',
  'affiliation': [{'name': 'Department of Health, Science and Technology'}]},
 {'family': 'abc',
  'given name': 'xyz',
  'affiliation': [{'name': 'Health Sciences, University of Science'}]}]

I want to break this list into different columns with the keys as column names and the values as columns values. Since the keys have duplicate names, it is ok for me to just add 1,2,3 as suffix for each column name. I tried the solution suggested in this question

df.join(pd.json_normalize(df.authors))

However, I first need to change the list into a simple dictionary and then use the above the solution. So I tried flattening the list to get the dictionaries and store them in the same column with:

df.author = [y for x in df.author for y in x]

But here I get the error that length of values does not match length of index.

Can anyone kindly help me in solving this issue? Thanks!

Hanif
  • 377
  • 4
  • 19

1 Answers1

0

One approach is to build records-list manually and the create a dataframe:

lst = [{'family': 'Yaisy',
  'given': 'Lisa',
  'affiliation': [{'name': 'Department of Sciences, Faculty Sciences, University of Science'}]},
 {'family': 'Kite',
  'given': 'Hume',
  'affiliation': [{'name': 'Department of Sciences, Science and Technology'}]},
 {'family': 'Jones',
  'given name': 'Mike',
  'localId': 'aza',
  'affiliation': [{'name': 'Department of Health, Science and Technology'}]},
 {'family': 'abc',
  'given name': 'xyz',
  'affiliation': [{'name': 'Health Sciences, University of Science'}]}]

records = []
for item in lst:
    records.append({
        'family': item['family'],
        'given': item.get('given', item.get('given name')),
        **item['affiliation'][0]
    })

df = pd.DataFrame(records)
print(df)

Prints:

  family given                                               name
0  Yaisy  Lisa  Department of Sciences, Faculty Sciences, Univ...
1   Kite  Hume     Department of Sciences, Science and Technology
2  Jones  Mike       Department of Health, Science and Technology
3    abc   xyz             Health Sciences, University of Science
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91
  • This dictionary and list was an example of one cell from my dataframe. But I have a full column in my df where each cell has these sort of dictionaries and lists. Therefore, i will need a solution that can be applied to the full column, perhaps cell by cell. – Hanif Oct 20 '20 at 08:26
  • @Hanif Then you can do `for item in df["my_column"]:` instead of `for item in lst:` – Andrej Kesely Oct 20 '20 at 09:07
  • If i try, I get error: TypeError: list indices must be integers or slices, not str – Hanif Oct 20 '20 at 09:49
  • @Hanif It means you have some other structure in your column. Without looking at real data is hard to figure it out... – Andrej Kesely Oct 20 '20 at 10:31
  • unfortunately i cannot share all the dataframe because it has personal data in it and also there are more than 7000 rows. That's why need a solution where I do not have to manually build the dataframe with specific columns/keys. – Hanif Oct 20 '20 at 10:41