0

I have a pandas DataFrame of a single column containing dictionaries as elements. It is the result of the following code:

dg # is a pandas dataframe with columns ID and VALUE. Many rows contain the same ID

def seriesFeatures(series):
    """This functions receives a series of VALUE for the same ID and extracts
    tens of complex features from the series, storing them into a dictionary"""
    dico = dict()
    dico['feature1'] = calculateFeature1
    dico['feature2'] = calculateFeature2
    # Many more features
    dico['feature50'] = calculateFeature50
    return dico

grouped = dg.groupby(['ID'])
dh = grouped['VALUE'].agg( { 'all_features' : lambda s: seriesFeatures(s) } )
dh.reset_index()
# Here I get a dh DataFrame of a single column 'all_features' and
# dictionaries stored on its values. The keys are the feature's names

I need to split this 'all_features' column into as many columns as I have features in an efficient manner (I have too many rows and columns, and I can NOT change the seriesFeatures function), so the output would be a dataframe with columns ID, FEATURE1, FEATURE2, FEATURE3, ... , FEATURE50. What would be the best way to do so ?

EDIT

A concrete and simple example :

dg = pd.DataFrame( [ [1,10] , [1,15] , [1,13] , [2,14] , [2,16] ] , columns=['ID','VALUE'] )

def seriesFeatures(series):
    dico = dict()
    dico['feature1'] = len(series)
    dico['feature2'] = series.sum()
    return dico

grouped = dg.groupby(['ID'])
dh = grouped['VALUE'].agg( { 'all_features' : lambda s: seriesFeatures(s) } )
dh.reset_index()

But when I try to wrap it with pd.Series or pd.DataFrame, it says that if data is a scalar value, an index must be provided. Providing index=['feature1','feature2'], I get weird results, for instance using: dh = grouped['VALUE'].agg( { 'all_features' : lambda s: pd.DataFrame( seriesFeatures(s) , index=['feature1','feature2'] ) } )

rafa
  • 795
  • 1
  • 8
  • 25

1 Answers1

1

I think you should wrap the dict in a Series, and then this will already expand in the groupby call (but then using apply instead of agg as it is not an aggregated (scalar) result anymore):

dh = grouped['VALUE'].aply(lambda s: pd.Series(seriesFeatures(s)))

After that, you can reshape the result to the desired format.

With your simple example case this seems to work:

In [22]: dh = grouped['VALUE'].apply(lambda x: pd.Series(seriesFeatures(x)))
In [23]: dh

Out[23]:
ID
1   feature1     3
    feature2    38
2   feature1     2
    feature2    30
dtype: int64

In [26]: dh.unstack().reset_index()
Out[26]:
   ID  feature1  feature2
0   1         3        38
1   2         2        30
joris
  • 133,120
  • 36
  • 247
  • 202
  • Thank you. I didn't know about this `unstack` thing, it seems to be a nice solution. – rafa Nov 05 '14 at 11:03