6

Starting with something like this:

from pandas import DataFrame
time = np.array(('2015-08-01T00:00:00','2015-08-01T12:00:00'),dtype='datetime64[ns]')
heat_index = np.array([101,103])
air_temperature = np.array([96,95])

df = DataFrame({'heat_index':heat_index,'air_temperature':air_temperature},index=time)

yielding this for df:

                     air_temperature    heat_index
2015-08-01 07:00:00  96                 101
2015-08-01 19:00:00  95                 103

then resample daily:

df_daily = df.resample('24H',how='max')

To get this for df_daily:

            air_temperature     heat_index
2015-08-01  96                  103

So by resampling using how='max' pandas resamples each 24 hour period, taking the maximum value within that period from each column.

But as you can see looking at df output for 2015-08-01, that day's maximum heat index (which occurs at 19:00:00) does not correlate with air temperature occurred at the same time. That is, the heat index of 103F was caused with an air temperature of 95F. This association is lost through resampling, and we end up looking at the air temperature from a different part of the day.

Is there a way to resample just one column, and preserve the value in another column at the same index? So that the final outcome would look like this:

            air_temperature     heat_index
2015-08-01  95                  103

My first guess is to just resample the heat_index column...

df_daily = df.resample('24H',how={'heat_index':'max'})

to get...

            air_temperature
2015-08-01  103

...and then trying to do some sort of DataFrame.loc or DataFrame.ix from there, but have been unsuccessful. Any thoughts on how to find the related value after resampling (e.g. to find the air_temperature that occurred at the same time as what is later found to be the maximum heat_index)?

csg2136
  • 235
  • 4
  • 10

1 Answers1

2

Here's one way - the .groupby(TimeGrouper()) is essentially what resample is doing, then the aggregation function filters each group to the max observation.

In [60]: (df.groupby(pd.TimeGrouper('24H'))
            .agg(lambda df: df.loc[df['heat_index'].idxmax(), :]))

Out[60]: 
            air_temperature  heat_index
2015-08-01               95         103
chrisb
  • 49,833
  • 8
  • 70
  • 70
  • This solution was quite helpful for my problem, too -- except that it doesn't seem to be very robust. If you try the following little program with the data file at http://redmine.iek.fz-juelich.de/attachments/download/483/test.csv it fails with a KeyError. With other datasets it works fine. Can someone tell me what to do to make it work also for the test data? – maschu Jun 22 '16 at 12:43
  • FILE="test.csv" sampling = 'AS' import pandas as pd data = pd.read_csv(FILE, index_col=0, parse_dates=True) grouped = data.groupby(pd.TimeGrouper(sampling)) dfres = grouped.agg(lambda df: df.loc[df['values'].idxmax(), :]) – maschu Jun 22 '16 at 12:44
  • OK - found the solution. I need to fillna() missing values beforehand. `data.loc[:,'values'].fillna(-999., inplace=True)`does the trick. – maschu Jun 22 '16 at 14:38