1

I've got a dataframe that contains a mostly NaN's, but also dictionaries in certain entries. My goal is expanding those dictionaries to columns of the dataframe and keeping their entries on their respective indices. This is what a small part of the dataframe looks like.

                                                                       _id  _score 
query                                                        
chrM:g.146T>C                                                            NaN  NaN  
chrM:g.11723C>T                                                          NaN  NaN                                       NaN  NaN   
chrM:g.11813A>G                                                          NaN  NaN   
chrM:g.12140T>A                                                          NaN  NaN   
...                                                                      ...  ...     
chr1:g.11976370T>G         {u'ref': u'T', u'alleles': [{u'allele': u'T', ...  NaN   
chr1:g.12007164A>G                                                       NaN  NaN   
chr1:g.12007165A>G                                                       NaN  NaN 

So far, I've just managed to pick the keys of each dict and add columns named with those keys:

s ={}
for cols in cols:
    if type(data1[cols].dropna()[0]) == type(s):
       cols_var = np.array(data1[cols].dropna()[0].keys())
       data1 = pandas.concat([data1,pandas.DataFrame(columns=cols_var)])

Any help or hints on how to do so in an efficient and readable will be much appreciated.

**EDIT: ** this code:

y = pandas.Series((dbsnp.iloc[0]))
print y

however retrieves something somewhat useful:

allele_origin                   unspecified
alleles                         [{u'allele': u'G'}, {u'allele': u'A'}]      
alt                              A
...                              ...
rsid                             rs201327123
vartype                          snp
dtype: object

I'll try working from here, other input is much appreciated.

Carlo Mazzaferro
  • 838
  • 11
  • 21
  • Can you show the full dictionary? It looks like it's nested. How do you want to expand it exactly? – IanS Apr 26 '16 at 07:50
  • You are correct. It is nested, and other columns as well may contain nested dictionaries. Here's that specific one: `{u'ref': u'G', u'alleles': [{u'allele': u'G'}, {u'allele': u'A'}], u'var_subtype': u'ts', u'allele_origin': u'unspecified', u'dbsnp_build': 137, u'rsid': u'rs201327123', u'flags': [u'ASP', u'R3'], u'hg19': {u'start': 14677, u'end': 14678}, u'alt': u'A', u'validated': True, u'chrom': u'1', u'class': u'SNV', u'vartype': u'snp'}` – Carlo Mazzaferro Apr 26 '16 at 08:44
  • Can you check my solution and see if it works for you? – IanS Apr 26 '16 at 08:46
  • I get the following error: `() got an unexpected keyword argument 'axis'`, which seems to be explained here: [http://stackoverflow.com/questions/29155310/trouble-passing-in-lambda-to-apply-for-pandas-dataframe]. Suggestions? – Carlo Mazzaferro Apr 26 '16 at 09:14
  • Sorry, I originally used a single-column dataframe instead of a series. I edited my answer (and the code is now cleaner). – IanS Apr 26 '16 at 09:37

1 Answers1

4

In order to expand the dictionary into a dataframe with multiple columns, you should apply a function that returns the dictionary as a pandas series. In order to do that, you have to remove NaN values first.

Let's start with a single column: s = data1['_id'].dropna(). The following will return an expanded dataframe made from the dictionaries:

expanded_df = s.apply(lambda row: pandas.Series(row))

This makes use of a lambda (i.e. inline) function that transforms the dictionary into a series. You can now merge the expanded dataframe with the original one:

pandas.concat([data1, expanded_df], axis=1)

The concat function will match the rows by index, and put NaN values for missing indices in expanded_df. Conveniently, those will be precisely the rows you removed with dropna in the first step.

You can now do this for all columns in data1 by iterating on the columns like you did originally, or by using apply on data1.

IanS
  • 15,771
  • 9
  • 60
  • 84
  • It worked! Just a couple of details: the nested dictionaries still remained as dictionaries; so I think I can run another run of such function to them after identifying them. Lastly, I see that the columns that contained dicts are still there so I'll have just to drop them. But besides that, looks good. Thank you. – Carlo Mazzaferro Apr 26 '16 at 09:47
  • Yes, I'm afraid repeating the procedure is the only way to deal with nested dictionaries. Either that or deal with them inside the lambda function. I can have a try later today... – IanS Apr 26 '16 at 09:56
  • Definitely. I tried up-voting it yesterday but I'm not able to do so since I've still got to reach 15 points of reputation xD. I'll also try repeating the method, will get you back. Thanks once more. – Carlo Mazzaferro Apr 26 '16 at 18:21
  • 1
    You could try to flatten the dictionary first: http://stackoverflow.com/questions/6027558/flatten-nested-python-dictionaries-compressing-keys – IanS Apr 27 '16 at 08:27