3

I was reading resample a dataframe with different functions applied to each column?

The solution was:

frame.resample('1H', how={'radiation': np.sum, 'tamb': np.mean})

Say if I want to add a non-existing column to the result that stores the value of some other function, say count(). In the example given, say if I want to compute the number of rows in each 1H period.

Is it possible to do:

frame.resample('1H', how={'radiation': np.sum, 'tamb': np.mean,\
               'new_column': count()})

Note, new_column is NOT an existing column in the original data frame.

The reason why I ask, is I need to do this and I have a very large data frame and I don't want to resample the original df twice just to get the count in the resample period.

I'm trying the above right now and it seems to be taking a very long time (no syntax errors). Not sure if python is trapped in some sort of forever loop.

Update:

I implemented the suggestion to use agg (thank you kindly for that).

However, I received the following error when computing the first aggregator:

grouped = df.groupby(['name1',pd.TimeGrouper('M')])
    return pd.DataFrame(
    {'new_col1': grouped['col1'][grouped['col1'] > 0].agg('sum')
    ...


/Users/blahblah/anaconda/lib/python2.7/site-packages/pandas/core/groupby.pyc in __getitem__(self, key)
    521 
    522     def __getitem__(self, key):
--> 523         raise NotImplementedError('Not implemented: %s' % key)
    524 
    525     def _make_wrapper(self, name):

NotImplementedError: Not implemented: True

The following works when I use grouped.apply(foo).

new_col1 = grp['col1'][grp['col1'] > 0].sum()
Community
  • 1
  • 1
codingknob
  • 11,108
  • 25
  • 89
  • 126

1 Answers1

3

resampling is similar to grouping with a TimeGrouper. While resampling's how parameter only allows you to specify one aggregator per column, The GroupBy object returned by df.groupby(...) has an agg method which can be passed various functions (e.g. mean, sum, or count) to aggregate the groups in various ways. You can use these results to build the desired DataFrame:

import datetime as DT
import numpy as np
import pandas as pd
np.random.seed(2016)

date_times = pd.date_range(DT.datetime(2012, 4, 5, 8, 0),
                           DT.datetime(2012, 4, 5, 12, 0),
                           freq='1min')
tamb = np.random.sample(date_times.size) * 10.0
radiation = np.random.sample(date_times.size) * 10.0
df = pd.DataFrame(data={'tamb': tamb, 'radiation': radiation},
                  index=date_times)

resampled = df.resample('1H', how={'radiation': np.sum, 'tamb': np.mean})
print(resampled[['radiation', 'tamb']])
#                       radiation      tamb
# 2012-04-05 08:00:00  279.432788  4.549235
# 2012-04-05 09:00:00  310.032188  4.414302
# 2012-04-05 10:00:00  257.504226  5.056613
# 2012-04-05 11:00:00  299.594032  4.652067
# 2012-04-05 12:00:00    8.109946  7.795668

def using_agg(df):
    grouped = df.groupby(pd.TimeGrouper('1H'))
    return pd.DataFrame(
        {'radiation': grouped['radiation'].agg('sum'), 
         'tamb': grouped['tamb'].agg('mean'), 
         'new_column': grouped['tamb'].agg('count')})

print(using_agg(df))

yields

                     new_column   radiation      tamb
2012-04-05 08:00:00          60  279.432788  4.549235
2012-04-05 09:00:00          60  310.032188  4.414302
2012-04-05 10:00:00          60  257.504226  5.056613
2012-04-05 11:00:00          60  299.594032  4.652067
2012-04-05 12:00:00           1    8.109946  7.795668

Note my first answer suggested using groupby/apply:

def using_apply(df):
    grouped = df.groupby(pd.TimeGrouper('1H'))
    result = grouped.apply(foo).unstack(-1)
    result = result.sortlevel(axis=1)
    return result[['radiation', 'tamb', 'new_column']]

def foo(grp):
    radiation = grp['radiation'].sum()
    tamb = grp['tamb'].mean()
    cnt = grp['tamb'].count()
    return pd.Series([radiation, tamb, cnt], index=['radiation', 'tamb', 'new_column'])

It turns out that using apply here is much slower than using agg. If we benchmark using_agg versus using_apply on a 1681-row DataFrame:

np.random.seed(2016)

date_times = pd.date_range(DT.datetime(2012, 4, 5, 8, 0),
                           DT.datetime(2012, 4, 6, 12, 0),
                           freq='1min')
tamb = np.random.sample(date_times.size) * 10.0
radiation = np.random.sample(date_times.size) * 10.0
df = pd.DataFrame(data={'tamb': tamb, 'radiation': radiation},
                  index=date_times)

I find using IPython's %timeit function

In [83]: %timeit using_apply(df)
100 loops, best of 3: 16.9 ms per loop

In [84]: %timeit using_agg(df)
1000 loops, best of 3: 1.62 ms per loop

using_agg is significantly faster than using_apply and (based on additional %timeit tests) the speed advantage in favor of using_agg grows as len(df) grows.


By the way, regarding

frame.resample('1H', how={'radiation': np.sum, 'tamb': np.mean,\
               'new_column': count()})

besides the problem that the how dict does not accept non-existant column names, the parentheses in count are problematic. The values in the how dict should be function objects. count is a function object, but count() is the value returned by calling count.

Since Python evaluates arguments before calling functions, count() is getting called before frame.resample(...), and the return value of count() is then associated with the key 'new_column' in the dict bound to the how parameter. That's not what you want.


Regarding the updated question: Precompute the values that you will need before calling groupby/agg:

Instead of

grouped = df.groupby(['name1',pd.TimeGrouper('M')])
return pd.DataFrame(
    {'new_col1': grouped['col1'][grouped['col1'] > 0].agg('sum')
     ...
# ImplementationError since `grouped['col1']` does not implement __getitem__

use

df['col1_pos'] = df['col1'].clip(lower=0)
grouped = df.groupby(['name1',pd.TimeGrouper('M')])
return pd.DataFrame(
    {'new_col1': grouped['col1_pos'].agg('sum')
    ...

See the bottom of this post for more on why pre-computation helps performance.

Community
  • 1
  • 1
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • Thank you for the detailed and thoughtful explanation. I truly appreciate it. The answer is taught me a lot. Thank you. – codingknob Mar 30 '16 at 14:31
  • Question: is there a difference in how each of the aggregator per column operations are performed versus the sequential computation of aggregators in foo(grp). In my case I have 30 aggregators in foo(grp) which takes a long time because it has to iterate through each aggregator at a time over the same set of groups. Not sure if how= works the same way. I am wondering if there is a way to vectorize the computation for speed? – codingknob Mar 30 '16 at 22:23
  • Indeed, I think my first suggestion to use `apply` was not a good one. There is a much faster way: use `groupby/agg` instead of `groupby/apply`. I've edited my post to show what I mean. – unutbu Mar 31 '16 at 00:14
  • Wow. Great thank you solution to use agg. I appreciate it. It seems I've run into another problem with the way I compute aggregators. I want to compute .agg('sum') for positive and negative values in separate columns. I was able to do this when using the apply(foo) method (but not optimal as you know). – codingknob Mar 31 '16 at 18:53
  • You may need to precompute new columns with just the positive values in one column and the negative values in the another column. Then you can apply `groupby/agg('sum')` to these new columns . – unutbu Mar 31 '16 at 19:02
  • Your updated solution works amazing. It is lightening fast compared to what I had before. Thank you. – codingknob Apr 01 '16 at 03:02