pandas advanced conditional groupby

Question

For each day of the month (1-31), I want to compute the number of trips occurring within any 60 minute period on that day,so like the first entry is 0:00, I want to get the number of trips from 0:00 to 1:00, the second entry is 0:01,I want to get the number of trips from 0:01 to 1:01.

And it must be within the same day, so from 23:00, you do not need to count the number of trips

How to do groupby on this ?

This doesn't sound like a group-by operation to me since you end up with the same number of rows you began with (no aggregation). Seems more like a function you would apply on your frame's rows. — ehudk, Apr 07 '18 at 20:39

hoyland · Answer 1 · 2018-04-07T21:14:49.027

I would do this in a few steps:

group by pickup time and count how many trips occur in each minute (I'm assuming you're doing this based on pickup time), call this dataframe by_minute
cross join by_minute with itself (so you have the cartesian product) and filter to the rows where the two pickup times satisfy your condition
group by (the "left hand") pickup time and sum the counts of the "right hand" trips

I believe you'll have to add a dummy column to by_minute to facilitate the join. Something like:

by_minute['key'] = 1
joined = by_minute.merge(by_minute, on='key')

There are some other options mentioned this question. merge has a suffixes argument (by default it's ('_x', '_y') that will distinguish the column names.

ehudk · Answer 2 · 2018-04-07T21:58:38.443

As I wrote in the comment, as far as I can tell, you're not looking for a groupby, but rather some operation on each row.
I came up with the following solution using apply:

s = pd.to_datetime(df["pickup_datetime"])    # make sure the pickup column is datetime
r = s.apply(lambda x: np.sum(s.between(x, x + pd.Timedelta("1hr")) & (s.dt.dayofyear == x.dayofyear)))

Lets break it down:
This will go over each row (apply) and create a Boolean mask based on two conditions:

All the pickup times that fall within an hour from the current pickup time.
All the pickup times that are within the same date (day of the year) as the current pickup time.

We then combine them with an AND operation (&).
This have created a boolean array the size of your Series with True wherever both conditions are met.
Finally, we sum (np for Numpy) this boolean array which is equivalent to counting the number of entries that met the condition above.

pandas advanced conditional groupby

2 Answers2