7

I have the following dataframe structure that is indexed with a timestamp:

    neg neu norm    pol pos date
time                        
1520353341  0.000   1.000   0.0000  0.000000    0.000   
1520353342  0.121   0.879   -0.2960 0.347851    0.000   
1520353342  0.217   0.783   -0.6124 0.465833    0.000   

I create a date from the timestamp:

data_frame['date'] = [datetime.datetime.fromtimestamp(d) for d in data_frame.time]

Result:

    neg neu norm    pol pos date
time                        
1520353341  0.000   1.000   0.0000  0.000000    0.000   2018-03-06 10:22:21
1520353342  0.121   0.879   -0.2960 0.347851    0.000   2018-03-06 10:22:22
1520353342  0.217   0.783   -0.6124 0.465833    0.000   2018-03-06 10:22:22

I want to group by hour, while getting the mean for all the values, except the timestamp, that should be the hour from where the group started. So this is the result I want to archive:

    neg neu norm    pol pos
time                    
1520352000  0.027989    0.893233    0.122535    0.221079    0.078779
1520355600  0.028861    0.899321    0.103698    0.209353    0.071811

The closest I have gotten so far has been with this answer:

data = data.groupby(data.date.dt.hour).mean()

Results:

    neg neu norm    pol pos
date                    
0   0.027989    0.893233    0.122535    0.221079    0.078779
1   0.028861    0.899321    0.103698    0.209353    0.071811

But I cant figure out how to keep the timestamp that takes in account he hour where the grouby started.

Franco
  • 848
  • 1
  • 12
  • 24

3 Answers3

21

I came across this gem, pd.DataFrame.resample, after I posted my round-to-hour solution.

# Construct example dataframe
times = pd.date_range('1/1/2018', periods=5, freq='25min')
values = [4,8,3,4,1]
df = pd.DataFrame({'val':values}, index=times)

# Resample by hour and calculate medians
df.resample('H').median()

Or you can use groupby with Grouper if you don't want times as index:

df = pd.DataFrame({'val':values, 'times':times})
df.groupby(pd.Grouper(level='times', freq='H')).median()
Cristian Ciupitu
  • 20,270
  • 7
  • 50
  • 76
Jordi
  • 1,313
  • 8
  • 13
  • Very neat answer – smerllo Dec 07 '19 at 23:01
  • These methods don't work if the data frame spans multiple days i.e. it does not ignore the date part of a datetime index. The original approach from the question `data = data.groupby(data.date.dt.hour).mean()` does that, but does indeed not preserve the hour. To preserve the hour in such a case you can pull the hour from the datetime index into a separate col before grouping: `data['hour']=data.index.hour` – squarespiral Dec 08 '22 at 17:51
4

Did you try creating an hour column by:

data_frame['hour'] = data_frame.date.dt.hour

Then grouping by hour like:

data = data.groupby(data.hour).mean()
Connor John
  • 433
  • 2
  • 8
  • Yes, that gives me the same result I have right now. The problem is keeping/generating the timestamp for the beginning of the hour. – Franco Mar 07 '18 at 19:11
2

You can round the timestamp column down to the nearest hour:

import math
df.time = [math.floor(t/3600) * 3600 for t in df.time]

Or even simpler, using integer division:

df.time = [(t//3600) * 3600 for t in df.time]

You can group by this column and thus preserve the timestamp.

Jordi
  • 1,313
  • 8
  • 13
  • How I didn't thought about this? This works perfectly, such a simple and elegant solution. Thanks! – Franco Mar 08 '18 at 01:53
  • If you already have a datetime index you can floor to the hour like this: `df.loc[df.index.floor('H')]` - also works with other time frequencies 'D', ... – squarespiral Dec 08 '22 at 17:54