0

I have a dataframe -

            starttime   Count
0   2013-10-01 00:00:00 274
1   2013-10-01 01:00:00 140
2   2013-10-01 02:00:00 67
3   2013-10-01 03:00:00 37
4   2013-10-01 04:00:00 57
... ... ...
739 2013-10-31 19:00:00 1690
740 2013-10-31 20:00:00 1207
741 2013-10-31 21:00:00 1011
742 2013-10-31 22:00:00 730
743 2013-10-31 23:00:00 438

In the above dataset the starttime column as hourly wise data for each day in a month with frequency count corresponding to each hour. I am trying to get the 3 largest or smallest counts with their starttime per day basis, the resultant dataframe should look like this -

              starttime     Count
0   2013-10-01 00:00:00       274
1   2013-10-01 01:00:00       140
2   2013-10-01 02:00:00        67
3   2013-10-02 06:00:00       300
4   2013-10-02 08:00:00       250
5   2013-10-02 08:00:00       100

I have tried multiple manupulation to achieve this but failed. I am not posting my attempts as there were many hit and trials to achieve it.

I have tried splitting the date and time separetely -

df2_start_timeseries['date'] = pd.to_datetime(df2_start_timeseries['starttime']).dt.date
df2_start_timeseries['time'] = pd.to_datetime(df2_start_timeseries['starttime']).dt.time

and tried grouping them and sort it by counts column, that too didnt work.

wjandrea
  • 28,235
  • 9
  • 60
  • 81
  • 3
    It'd help to provide some complete but minimal example data and complete expected output for it. See [How to make good reproducible pandas examples](/q/20109391/4518341). For more info, see [mre]. – wjandrea May 14 '23 at 18:30

2 Answers2

3

You could groupby date, and rank the count values (below I have used 'max'). Then return only the rows where the rank is '<=3':

import pandas as pd
import numpy as np

np.random.seed(4)

times = pd.date_range("2013-10-01", "2013-10-31", freq="H")
c = np.random.randint(0, 500, len(times))

df = pd.DataFrame({"starttime": times,
                   "count": c})

# groupby by date, rank the 'count' column and only return rows where rank is <= 3
df.loc[df.groupby(df["starttime"].dt.date)["count"].rank("max", ascending=False).le(3), :]

If there might be duplicates that you would also want to use, you could use 'dense' which will take the minimum values. See more here.

Alternatively, you could sort the values then take the first 3 in each date group, as shown below as # option 2.

Why rank().le(3)

The question asks for the minimum 3, or maximum 3 values for each day, so the "count" column needs to be ordered somehow within each group.

# for maximum
# option 1: rank().le(3)
ranked = df.groupby(df["starttime"].dt.date)["count"].rank("max", ascending=False)
ranked.head()
# Out[]: 
# 0    19.0
# 1    16.0
# 2     3.0
# 3    15.0
# 4     6.0
# Name: count, dtype: float64
ranked = ranked.le(3)  # returned boolean of True/False if less than or equal to 3
df.loc[ranked, :]  # returns the filtered dataframe.
# Out[]: 
#               starttime  count
# 2   2013-10-01 02:00:00    439
# 7   2013-10-01 07:00:00    456
# 16  2013-10-01 16:00:00    439
# 24  2013-10-02 00:00:00    490
# 33  2013-10-02 09:00:00    486
# ..                  ...    ...
# 695 2013-10-29 23:00:00    476
# 709 2013-10-30 13:00:00    484
# 711 2013-10-30 15:00:00    444
# 715 2013-10-30 19:00:00    447
# 720 2013-10-31 00:00:00    495

# [89 rows x 2 columns]

# option 2: sort_values().head(3)
# sort the rows by count and start time, then groupby date and take the first 3 values
sorted_ = df.sort_values(["count", "starttime"], ascending=[False, True]).groupby(df["starttime"].dt.date).head(3)
# sort again by starttime so they are back in the correct order.
sorted_.sort_values("starttime")
# Out[]: 
#               starttime  count
# 2   2013-10-01 02:00:00    439
# 7   2013-10-01 07:00:00    456
# 16  2013-10-01 16:00:00    439
# 24  2013-10-02 00:00:00    490
# 33  2013-10-02 09:00:00    486
# ..                  ...    ...
# 695 2013-10-29 23:00:00    476
# 709 2013-10-30 13:00:00    484
# 711 2013-10-30 15:00:00    444
# 715 2013-10-30 19:00:00    447
# 720 2013-10-31 00:00:00    495

# [91 rows x 2 columns]

As you can see in the above, there is a slight difference in what is returned. The above rank("max", ascending=False) will set duplicates to be the last of the duplicate rank, i.e. if sorted as [5, 4, 4, 3, 2], the rank would be [1, 3, 3, 4, 5]. Using "min" will be first of duplicate rank: [1, 2, 2, 4, 5], and using rank("dense", ascending=False), then it would instead be [1, 2, 2, 3, 4] - both of which might return more than 3 rows of data, but all of the rows with the 3 highest counts.

Rawson
  • 2,637
  • 1
  • 5
  • 14
  • Would you explain why `df.groupby(df["starttime"].dt.date)["count"]` is not the case here and what is the effect/advantage of `rank().le(3)` ? would you use `np.random.seed(4)` to explain why we need to *...only return rows where rank is <= 3* based on your comment on scripts? I just want to learn. – Mario May 14 '23 at 21:36
  • 1
    Sure. To to get the maximum/minimum, you can rank the values in the "count" column - and doing so with `"max", ascending=True` means that the highest value will be ranked as `1`, second highest as `2`, etc. Then using `le(3)`, which is 'less than or equal to 3', only those that are the top 3 highest counts will be returned, as stated in the question. You need to rank the data in order to take the correct rows of the grouped data. Similarly, you could use `.sort_values(descending=True).head(3)` to sort the "count" column and take the top 3 values - this has the same result. – Rawson May 14 '23 at 21:52
  • 1
    I have updated the answer with a `seed(4)`. – Rawson May 14 '23 at 22:11
  • being a novice programmer I would like to understand the thought process for option 2, more than the code the way you thought it out, matters a lot. Can you please explain it a bit more? Option 1 is a tad trickier I felt. @Rawson – Ritesh Mukhopadhyay May 15 '23 at 18:03
  • 1
    For option 2, my thought process was: the data could be grouped then sorted, and the top 3 can be taken through `.head()` - that was where I started. Then, I knew (and this could be worked out through trial and error) that the grouped data cannot be sorted, which meant that the rows needed to be sorted first, then grouped second. For the order of the sort, it is much easier to sort by the pure column rather than a manipulation of the column - so the datetime values should be left as they are. Therefore arranging by count then starttime was simplest. – Rawson May 15 '23 at 18:44
  • 1
    As the output was of that line was then sorted by count still, filtered for the first 3 for each date, the data needed to be sorted again by "starttime" so that it was back in chronological order as it started. – Rawson May 15 '23 at 18:46
2

You can group by date and then select the 3 largest/smallest from each group with nlargest/nsmallest:

top_3_large = df.groupby(df["starttime"].dt.date, group_keys=False)["count"].nlargest(3)
7      456
2      439
16     439
24     490
33     486
      ... 
678    471
709    484
715    447
711    444
720    495
Name: count, Length: 91, dtype: int64

(You can change group_keys to True to see the groups clearly when printing)

Then join the result with the initial df's starttime column to recover the starttime of each instance:

pd.concat([df["starttime"], top_3_large.rename("3_largest")], join="inner", axis=1)
              starttime  3_largest
2   2013-10-01 02:00:00        439
7   2013-10-01 07:00:00        456
16  2013-10-01 16:00:00        439
24  2013-10-02 00:00:00        490
33  2013-10-02 09:00:00        486
..                  ...        ...
695 2013-10-29 23:00:00        476
709 2013-10-30 13:00:00        484
711 2013-10-30 15:00:00        444
715 2013-10-30 19:00:00        447
720 2013-10-31 00:00:00        495

[91 rows x 2 columns]
CarlosGDCJ
  • 424
  • 1
  • 8