48

I have a times series with temperature and radiation in a pandas dataframe. The time resolution is 1 minute in regular steps.

import datetime
import pandas as pd
import numpy as np

date_times = pd.date_range(datetime.datetime(2012, 4, 5, 8, 0),
                           datetime.datetime(2012, 4, 5, 12, 0),
                           freq='1min')
tamb = np.random.sample(date_times.size) * 10.0
radiation = np.random.sample(date_times.size) * 10.0
frame = pd.DataFrame(data={'tamb': tamb, 'radiation': radiation},
                     index=date_times)
frame
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 241 entries, 2012-04-05 08:00:00 to 2012-04-05 12:00:00
Freq: T
Data columns:
radiation    241  non-null values
tamb         241  non-null values
dtypes: float64(2)

How can I down-sample this dataframe to a resolution of one hour, computing the hourly mean for the temperature and the hourly sum for radiation?

piRSquared
  • 285,575
  • 57
  • 475
  • 624
bmu
  • 35,119
  • 13
  • 91
  • 108

4 Answers4

77

With pandas 0.18 the resample API changed (see the docs). So for pandas >= 0.18 the answer is:

In [31]: frame.resample('1H').agg({'radiation': np.sum, 'tamb': np.mean})
Out[31]: 
                         tamb   radiation
2012-04-05 08:00:00  5.161235  279.507182
2012-04-05 09:00:00  4.968145  290.941073
2012-04-05 10:00:00  4.478531  317.678285
2012-04-05 11:00:00  4.706206  335.258633
2012-04-05 12:00:00  2.457873    8.655838

Old Answer:

I am answering my question to reflect the time series related changes in pandas >= 0.8 (all other answers are outdated).

Using pandas >= 0.8 the answer is:

In [30]: frame.resample('1H', how={'radiation': np.sum, 'tamb': np.mean})
Out[30]: 
                         tamb   radiation
2012-04-05 08:00:00  5.161235  279.507182
2012-04-05 09:00:00  4.968145  290.941073
2012-04-05 10:00:00  4.478531  317.678285
2012-04-05 11:00:00  4.706206  335.258633
2012-04-05 12:00:00  2.457873    8.655838
bmu
  • 35,119
  • 13
  • 91
  • 108
  • 7
    This can be extended to a list of functions per column: `frame.resample('1H', how={'radiation': [np.sum, np.min], 'tamb': np.mean})`. The resulting DataFrame has a MultiIndex on its columns, with the original column name as level 0 and the function name as level 1. – Def_Os May 26 '15 at 23:36
  • 3
    To add to my previous comment: instead of a list of functions per column, you can also use a dictionary, where the key is the new column name and the value is the function to use: `frame.resample('1H', how={'radiation': {'sum_rad': np.sum, 'min_rad': np.min}, 'tamb': np.mean})` – Def_Os May 27 '15 at 20:27
  • say if you want to add a NEW column into the result, such as count() of each row in the resample period. – codingknob Mar 29 '16 at 19:45
  • @codingknob: Sorry, I don't undestand your comment. – bmu Apr 08 '16 at 12:24
  • @bmu , what if I want to aggregate by a custom function that combines data from two columns. For instance, what if I want to aggregate by np.mean(frame['radiation'] * frame['tamb']) ? – Amitai Jul 23 '17 at 12:37
  • 1
    what you will advise in case of .bfill()? np.bfill() doesn't exist – Catherine Nosova Jan 31 '20 at 11:20
  • frame.resample() no longer accepts 'how' for pandas >1 – Rich Andrews Feb 15 '21 at 21:52
  • @CatherineNosova I'm afraid the solution based on `agg` only works for downsampling use cases (where there is something to aggregate). This answer doesn't seem to address the upsampling case, as also [discussed in this question](https://stackoverflow.com/questions/65116045/incomplete-filling-when-upsampling-with-agg-for-multiple-columns-pandas-resam). – bluenote10 Oct 07 '21 at 20:14
3

You can also downsample using the asof method of pandas.DateRange objects.

In [21]: hourly = pd.DateRange(datetime.datetime(2012, 4, 5, 8, 0),
...                          datetime.datetime(2012, 4, 5, 12, 0),
...                          offset=pd.datetools.Hour())

In [22]: frame.groupby(hourly.asof).size()
Out[22]: 
key_0
2012-04-05 08:00:00    60
2012-04-05 09:00:00    60
2012-04-05 10:00:00    60
2012-04-05 11:00:00    60
2012-04-05 12:00:00    1
In [23]: frame.groupby(hourly.asof).agg({'radiation': np.sum, 'tamb': np.mean})
Out[23]: 
                     radiation  tamb 
key_0                                
2012-04-05 08:00:00  271.54     4.491
2012-04-05 09:00:00  266.18     5.253
2012-04-05 10:00:00  292.35     4.959
2012-04-05 11:00:00  283.00     5.489
2012-04-05 12:00:00  0.5414     9.532
Garrett
  • 47,045
  • 6
  • 61
  • 50
3

To tantalize you, in pandas 0.8.0 (under heavy development in the timeseries branch on GitHub), you'll be able to do:

In [5]: frame.convert('1h', how='mean')
Out[5]: 
                     radiation      tamb
2012-04-05 08:00:00   7.840989  8.446109
2012-04-05 09:00:00   4.898935  5.459221
2012-04-05 10:00:00   5.227741  4.660849
2012-04-05 11:00:00   4.689270  5.321398
2012-04-05 12:00:00   4.956994  5.093980

The above mentioned methods are the right strategy with the current production version of pandas.

Wes McKinney
  • 101,437
  • 32
  • 142
  • 108
  • Thanks, but what I want to have would be something like `frame.convert('1h', how={'radiation': 'sum, 'tamb': 'mean'})`. Is this an option in 0.8? – bmu Apr 08 '12 at 10:30
  • @ Wes McKinney this should be `resample` in 0.8, isn't it? – bmu Jun 15 '12 at 20:58
  • 1
    If you would update your answer, I would accept it. otherwise you should remove it I think, because it will point users to the wrong direction. – bmu Oct 13 '12 at 08:23
1

You need to use groupby as such:

grouped = frame.groupby(lambda x: x.hour)
grouped.agg({'radiation': np.sum, 'tamb': np.mean})
# Same as: grouped.agg({'radiation': 'sum', 'tamb': 'mean'})

with the output being:

        radiation      tamb
key_0                      
8      298.581107  4.883806
9      311.176148  4.983705
10     315.531527  5.343057
11     288.013876  6.022002
12       5.527616  8.507670

So in essence I am splitting on the hour value and then calculating the mean of tamb and the sum of radiation and returning back the DataFrame (similar approach to R's ddply). For more info I would check the documentation page for groupby as well as this blog post.

Edit: To make this scale a bit better you could group on both the day and time as such:

grouped = frame.groupby(lambda x: (x.day, x.hour))
grouped.agg({'radiation': 'sum', 'tamb': 'mean'})
          radiation      tamb
key_0                        
(5, 8)   298.581107  4.883806
(5, 9)   311.176148  4.983705
(5, 10)  315.531527  5.343057
(5, 11)  288.013876  6.022002
(5, 12)    5.527616  8.507670
diliop
  • 9,241
  • 5
  • 28
  • 23