Selecting top N rows for each group in dataframe

Question

So i have an example of dataframe below:

Index        Country

4.1             USA
2.1             USA
5.2             USA
1.1             Singapore
6.2             Singapore
8.1             Germany
4.5             Italy
7.1             Italy
2.3             Italy
5.9             Italy
8.8             Russia

And, I intend to get the N elements for each group of Country in the dataframe. For example, if N = 3, then I will take 3 rows from each group, and if any particular group doesn't have N elements like Singapore, then it will just take what is sufficient which is the two records with Country label Singapore. The same applies for Country label with more than N elements such as Italy, hence it will just take three of it.

For N = 3, the output dataframe would be:

Index         Country
4.1             USA
2.1             USA
5.2             USA
1.1             Singapore
6.2             Singapore
8.1             Germany
4.5             Italy
7.1             Italy
2.3             Italy
8.8             Russia

I was thinking of something like:

aggregate(df, by=list(df$Country), head(df, 3))

But it doesn't seemed to work.

Does this answer your question? [Select the top N values by group](https://stackoverflow.com/questions/14800161/select-the-top-n-values-by-group) — Matt, Feb 06 '20 at 13:58
The "correct" syntax for aggregate would be `aggregate(df, by = list(df$Country), FUN = head, 3)`, but `aggregate` wants to return 1 row per group, so it adds the additional rows as extra columns, so it's not great. — Gregor Thomas, Feb 06 '20 at 14:02

Georgery · Accepted Answer · 2020-02-06T14:03:22.803

Using the dplyrpackage in the tidyverse you can do this:

library(tidyverse)

df <- tribble(
    ~Index, ~Country
    , 4.1, "USA"
    , 2.1, "USA"
    , 5.2, "USA"
    , 1.1, "Singapore"
    , 6.2, "Singapore"
    , 8.1, "Germany"
    , 4.5, "Italy"
    , 7.1, "Italy"
    , 2.3, "Italy"
    , 5.9, "Italy"
    , 8.8, "Russia"
)

df %>% # take the dataframe
    group_by(Country) %>% # group it by the grouping variable
    slice(1:3) # and pick rows 1 to 3 per group

Output:

   Index Country  
   <dbl> <chr>    
 1   8.1 Germany  
 2   4.5 Italy    
 3   7.1 Italy    
 4   2.3 Italy    
 5   8.8 Russia   
 6   1.1 Singapore
 7   6.2 Singapore
 8   4.1 USA      
 9   2.1 USA      
10   5.2 USA

Selecting top N rows for each group in dataframe

1 Answers1