1

I have a list of time-series (=pandas dataframe) and want to calculate for each time-series (of a device) the matrixprofile. One option is to iterate all the devices - which seems to be slow. A second option would be to group by the devices - and apply a UDF. The problem is now, that the UDF will return 1:1 rows i.e. not a single scalar value per group but the same number of rows will be outputted as the input.

Is it still possible to somehow vectorize this calculation for reach group when 1:1 (or at least non scalar values) are returned?

import pandas as pd
df = pd.DataFrame({
    'foo':[1,2,3], 'baz':[1.1, 0.5, 4], 'bar':[1,2,1]
})
display(df)

print('***************************')
# slow version retaining all the rows
for g in df.bar.unique():
    print(g)
    
    this_group = df[df.bar == g]
    # perform a UDF which needs to have all the values per group
    # i.e. for real I want to calculate the matrixprofile for each time-series of a device
    this_group['result'] = this_group.baz.apply(lambda x: 1)
    display(this_group)

print('***************************')

def my_non_scalar1_1_agg_function(x):
    display(pd.DataFrame(x))
    return x

# neatly vectorized application of a non_scalar function
# but this fails as:  Must produce aggregated value
df = df.groupby(['bar']).baz.agg(my_non_scalar1_1_agg_function)
display(df)
Georg Heiler
  • 16,916
  • 36
  • 162
  • 292
  • For this, we may need to see particulars of UDF. – Parfait Nov 09 '20 at 12:54
  • Sure: https://gist.github.com/geoHeil/7344932b27f05bfaab551b3b948ac2c5 see for code which generates an exaample dataset and uses the `stumpy.stump` UDF. – Georg Heiler Nov 09 '20 at 13:31
  • I guess that the second (non accepted) answer: https://stackoverflow.com/questions/42171132/is-it-possible-to-do-applymap-using-the-groupby-in-pandas should work here as well and give it a try – Georg Heiler Nov 09 '20 at 14:24
  • Does `stumpy.stump` return *a single scalar value*? [Docs](https://readthedocs.org/projects/stumpy/downloads/pdf/stable/) indicates it returns an `ndarray` of 4 columns. Please post example output of one call and what *single scalar value* you need to extract. – Parfait Nov 09 '20 at 15:07

2 Answers2

4

For non-aggregated functions applied to each distinct group that does not return a non-scalar value, you need to iterate method across groups and then compile together.

Therefore, consider a list or dict comprehension using groupby(), followed by concat. Be sure method inputs and returns a full data frame, series, or ndarray.

# LIST COMPREHENSION
df_list = [ myfunction(sub) for index, sub in df.groupby(['group_column']) ]
final_df = pd.concat(df_list)

# DICT COMPREHENSION
df_dict = { index: myfunction(sub) for index, sub in df.groupby(['group_column']) }
final_df = pd.concat(df_dict, ignore_index=True)
Parfait
  • 104,375
  • 17
  • 94
  • 125
0

Indeed this (see also the link above in the comment) is a way to get it to work in a faster/more desired way. Perhaps there is even a better alternative

import pandas as pd
df = pd.DataFrame({
    'foo':[1,2,3], 'baz':[1.1, 0.5, 4], 'bar':[1,2,1]
})
display(df)

grouped_df = df.groupby(['bar'])

altered = []
for index, subframe in grouped_df:
    display(subframe)
    subframe = subframe# obviously we need to apply the UDF here - not the idempotent operation (=doing nothing)
    altered.append(subframe)
    print (index)
    #print (subframe)
   
pd.concat(altered, ignore_index=True)
#pd.DataFrame(altered)
Georg Heiler
  • 16,916
  • 36
  • 162
  • 292