You could groupby date, and rank the count values (below I have used 'max'). Then return only the rows where the rank is '<=3':
import pandas as pd
import numpy as np
np.random.seed(4)
times = pd.date_range("2013-10-01", "2013-10-31", freq="H")
c = np.random.randint(0, 500, len(times))
df = pd.DataFrame({"starttime": times,
"count": c})
# groupby by date, rank the 'count' column and only return rows where rank is <= 3
df.loc[df.groupby(df["starttime"].dt.date)["count"].rank("max", ascending=False).le(3), :]
If there might be duplicates that you would also want to use, you could use 'dense' which will take the minimum values. See more here.
Alternatively, you could sort the values then take the first 3 in each date group, as shown below as # option 2
.
Why rank().le(3)
The question asks for the minimum 3, or maximum 3 values for each day, so the "count" column needs to be ordered somehow within each group.
# for maximum
# option 1: rank().le(3)
ranked = df.groupby(df["starttime"].dt.date)["count"].rank("max", ascending=False)
ranked.head()
# Out[]:
# 0 19.0
# 1 16.0
# 2 3.0
# 3 15.0
# 4 6.0
# Name: count, dtype: float64
ranked = ranked.le(3) # returned boolean of True/False if less than or equal to 3
df.loc[ranked, :] # returns the filtered dataframe.
# Out[]:
# starttime count
# 2 2013-10-01 02:00:00 439
# 7 2013-10-01 07:00:00 456
# 16 2013-10-01 16:00:00 439
# 24 2013-10-02 00:00:00 490
# 33 2013-10-02 09:00:00 486
# .. ... ...
# 695 2013-10-29 23:00:00 476
# 709 2013-10-30 13:00:00 484
# 711 2013-10-30 15:00:00 444
# 715 2013-10-30 19:00:00 447
# 720 2013-10-31 00:00:00 495
# [89 rows x 2 columns]
# option 2: sort_values().head(3)
# sort the rows by count and start time, then groupby date and take the first 3 values
sorted_ = df.sort_values(["count", "starttime"], ascending=[False, True]).groupby(df["starttime"].dt.date).head(3)
# sort again by starttime so they are back in the correct order.
sorted_.sort_values("starttime")
# Out[]:
# starttime count
# 2 2013-10-01 02:00:00 439
# 7 2013-10-01 07:00:00 456
# 16 2013-10-01 16:00:00 439
# 24 2013-10-02 00:00:00 490
# 33 2013-10-02 09:00:00 486
# .. ... ...
# 695 2013-10-29 23:00:00 476
# 709 2013-10-30 13:00:00 484
# 711 2013-10-30 15:00:00 444
# 715 2013-10-30 19:00:00 447
# 720 2013-10-31 00:00:00 495
# [91 rows x 2 columns]
As you can see in the above, there is a slight difference in what is returned. The above rank("max", ascending=False)
will set duplicates to be the last of the duplicate rank, i.e. if sorted as [5, 4, 4, 3, 2]
, the rank would be [1, 3, 3, 4, 5]
. Using "min"
will be first of duplicate rank: [1, 2, 2, 4, 5]
, and using rank("dense", ascending=False)
, then it would instead be [1, 2, 2, 3, 4]
- both of which might return more than 3 rows of data, but all of the rows with the 3 highest counts.