6

I have a Pandas dataframe with two columns which are speed and time.

speed   date
54.72   1:33:56
49.37   1:33:59
37.03   1:34:03
24.02   7:39:58
28.02   7:40:01
24.04   7:40:04
24.02   7:40:07
25.35   7:40:10
26.69   7:40:13
32.04   7:40:16
28.02   11:05:43
30.71   11:05:46
29.36   11:05:49
18.68   11:05:52
54.72   11:05:55
34.69   10:31:34
25.03   10:31:38
56.04   10:31:40
44.03   10:31:43

I want to calculate the average of speeds per bins of 30 minutes. For example, the average speed during the 4th bin (1:31 - 2:00) is (54.72 + 49.37 + 37.03)/3. I have thought of converting hours, minutes and seconds to seconds from 00:00 and then have bins of 1800 seconds. I have tried to do use binned_statistic from scipy.stats but my main issue is that I cannot find a way to separate bins based on date and get the average of speeds.

Any ideas?

hellpanderr
  • 5,581
  • 3
  • 33
  • 43
manosbar
  • 318
  • 2
  • 3
  • 15
  • What have you tried so far? – norok2 Sep 20 '18 at 14:04
  • @norok2 I have tried to do use `binned_statistic` from `scipy.stats` but I cannot find a way to separate bins based on date and get the average of speeds. https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.binned_statistic.html – manosbar Sep 20 '18 at 14:07
  • Possible duplicate of [Time difference within group by objects in Python Pandas](https://stackoverflow.com/questions/41929772/time-difference-within-group-by-objects-in-python-pandas) – Nihal Sep 20 '18 at 14:07
  • @Nihal my main issue is what I replied above and have now added in the description of the question. Thank you for your remark in either case. – manosbar Sep 20 '18 at 14:11
  • 1
    This question is useful, specially given the answer by hellpander, but you should mention in the title that it has to do with binning according to a "date" column. Otherwise nobody will find it. – HerrIvan Sep 20 '18 at 14:20

2 Answers2

19

Converting to datetime and using pandas.Grouper + Offset Aliases:

df['date'] = pd.to_datetime(df.date)
df.groupby(pd.Grouper(key='date', freq='30min')).mean().dropna()

    speed
date    
2018-09-20 01:30:00     47.040000
2018-09-20 07:30:00     26.311429
2018-09-20 10:30:00     39.947500
2018-09-20 11:00:00     32.298000
hellpanderr
  • 5,581
  • 3
  • 33
  • 43
  • 1
    A good solution. The only thing I'd be a bit cautious of, is that you're using `datetime` to represent a variable that doesn't actually have a `date` associated with it. While `pd.Grouper` gives you the bins you want with a `datetime`, it creates totally different bins (edges from the first `timedelta` in your `Series`), when using a `timedelta`, so it can be somewhat less predictable. – ALollz Sep 20 '18 at 14:27
  • 2
    Yes, your solution is cleaner. – hellpanderr Sep 20 '18 at 14:28
  • 1
    I do think this is great, and the voting clearly agrees. I just find that `pandas` built in date binning methods tend to be inconsistent with whether the bins start from some "sensible" edges (i.e from 00:00:00) or from the earliest measurement in your `Series`. That seems to happen here, simply changing from `datetime` to `timedelta`, which is a shame. – ALollz Sep 20 '18 at 14:31
  • I believe this is the best solution. I also kept `pd.to_timedelta` from @ALollz 's answer below, as I didn't want today's date to my output. – manosbar Sep 20 '18 at 14:49
6

Since your date column isn't really a date, it's probably more sensible to convert it to a timedelta that way you don't have a date attached to it.

Then, you can use dt.floor to group into 30 minute bins.

import pandas as pd

df['date'] = pd.to_timedelta(df.date)
df.groupby(df.date.dt.floor('30min')).mean()

Output:

              speed
date               
01:30:00  47.040000
07:30:00  26.311429
10:30:00  39.947500
11:00:00  32.298000
ALollz
  • 57,915
  • 7
  • 66
  • 89