0

This is sort of an extension on a previous question I asked, but different scope and approach.

I have a dataframe with a column populated by lists of dictionaries in each row

0    [{"date":"0 1 0" firstBoxerRating:[null null] ...
1    [{"date":"2 2 1" firstBoxerRating:[null null] ...
2    [{"date":"2013-10-05" firstBoxerRating:[null n...

This is short sample of some of the info In a given row:

[{"date":"2 2 1" firstBoxerRating:[null null] firstBoxerWeight:201.75 judges:[{"id":404749 name:"David Hudson" scorecard:[]} {"id":477070 name:"Mark Philips" scorecard:[]} {"id":404277 name:"Oren Shellenberger" scorecard:[]}] links:{"bio":1346666 bout:"558867/1346666" event:558867 other:[]} location:"Vanderbilt University Memorial Gymnasium Nashville" metadata:" time: 2:54\n | <span>referee:</span> <a href=\"/en/referee/403887\">Anthony Bryant</a><span> | </span><a href=\"/en/judge/404749\">David Hudson</a> | <a href=\"/en/judge/477070\">Mark Philips</a> 

I would like to create a clean dataframe where the key in the dictionary becomes the column and the value, the row related to the particular column.

So here is an example of my desired output using the short sample as the input data:

date   firstBoxerRating  firstBoxerWeight judges  id.......
2 2 1    [null null]          201.75              404749.....

I do not believe the question is a duplicate of this

Have tried every solution in this question, my data also contains lists of nested dictionaries, if anything resembling a json

For example, this solution:

pd.DataFrame.from_dict({(i,j): df[i][j] 
                           for i in df.keys() 
                           for j in df[i].keys()},
                       orient='index')

produces the exact same output I have

I have also tried unpacking the dicts in the column:

df[0].apply(pd.Series)

However, again this produces the same output

Emm
  • 2,367
  • 3
  • 24
  • 50
  • Possible duplicate of [Construct pandas DataFrame from items in nested dictionary](https://stackoverflow.com/questions/13575090/construct-pandas-dataframe-from-items-in-nested-dictionary) – abhilb Nov 07 '19 at 09:51
  • @abhilb its a list of nested dictionaries though – Emm Nov 07 '19 at 09:52
  • 1
    Can you provide an expected output - I'm finding it confusing on what you need :) also... the sample data doesn't look to be a valid list, dictinoary or JSON object. I don't see any commas separating anything. Is this a giant string? – MattR Nov 07 '19 at 15:47
  • @MattR so it seems to resemble either a json or rather a list of dictionaries. I have updated the desired output to give you more clarity – Emm Nov 07 '19 at 15:49
  • It may *resemble* one. but it's actually not a valid format for either :/ Dictionaries need to be like `{key:value, other_key:other_value}`. And lists need to be `[1,2,3,4,5]`. But there are no commas in between any of your data. I think there needs to be some data cleanup or extracted in a different way before you can solve the dataframe piece! – MattR Nov 07 '19 at 15:55

1 Answers1

0

Managed to resolve this issue with using regex and str.extract.

I extract the text between two strings and append said text to its relevant column

Example:

df[0].str.extract('date(?P<date>.*?)firstBoxerRating(?P<firstBoxerRating>.*?)firstBoxerWeight(?P<firstBoxerWeight>.*?)judges(?P<JudgeID>.*?)links(?P<Links>.*?)location(?P<location>.*?)metadata(?P<metadata>.*?)')
Emm
  • 2,367
  • 3
  • 24
  • 50