I have a pandas DataFrame
of a single column containing dictionaries as elements. It is the result of the following code:
dg # is a pandas dataframe with columns ID and VALUE. Many rows contain the same ID
def seriesFeatures(series):
"""This functions receives a series of VALUE for the same ID and extracts
tens of complex features from the series, storing them into a dictionary"""
dico = dict()
dico['feature1'] = calculateFeature1
dico['feature2'] = calculateFeature2
# Many more features
dico['feature50'] = calculateFeature50
return dico
grouped = dg.groupby(['ID'])
dh = grouped['VALUE'].agg( { 'all_features' : lambda s: seriesFeatures(s) } )
dh.reset_index()
# Here I get a dh DataFrame of a single column 'all_features' and
# dictionaries stored on its values. The keys are the feature's names
I need to split this 'all_features'
column into as many columns as I have features in an efficient manner (I have too many rows and columns, and I can NOT change the seriesFeatures
function), so the output would be a dataframe with columns ID
, FEATURE1
, FEATURE2
, FEATURE3
, ... , FEATURE50
. What would be the best way to do so ?
EDIT
A concrete and simple example :
dg = pd.DataFrame( [ [1,10] , [1,15] , [1,13] , [2,14] , [2,16] ] , columns=['ID','VALUE'] )
def seriesFeatures(series):
dico = dict()
dico['feature1'] = len(series)
dico['feature2'] = series.sum()
return dico
grouped = dg.groupby(['ID'])
dh = grouped['VALUE'].agg( { 'all_features' : lambda s: seriesFeatures(s) } )
dh.reset_index()
But when I try to wrap it with pd.Series or pd.DataFrame, it says that if data is a scalar value, an index must be provided. Providing index=['feature1','feature2'], I get weird results, for instance using: dh = grouped['VALUE'].agg( { 'all_features' : lambda s: pd.DataFrame( seriesFeatures(s) , index=['feature1','feature2'] ) } )