Selecting top N rows for each group based on value in column

Question

I have dataframe like below :-

x<-c(3,2,1,8,7,11,10,9,7,5,4)
y<-c("a","a","a", "b","b","c","c","c","c","c","c")
z<-c(2,2,2,1,1,3,3,3,3,3,3)
df<-data.frame(x,y,z)

df
    x y z
1   3 a 2
2   2 a 2
3   1 a 2
4   8 b 1
5   7 b 1
6  11 c 3
7  10 c 3
8   9 c 3
9   7 c 3
10  5 c 3
11  4 c 3

I want to select top n row for each group by column y where n is provided in column z. So the output should be like :

Would `z` values be always same for a group? What if they are different? How to select `n`? — Ronak Shah, Jul 10 '17 at 08:08
z values are always same for a group.So n value for group "a" is 2 ,for "b" is 1 and for "c" is 3. — vsb, Jul 10 '17 at 08:10
`library(dplyr); df %>% group_by(y) %>% slice(1:z[1])` should work. — HNSKD, Jul 10 '17 at 08:13
@HNSKD your code works.. but in case two values in column x is similar for a group then ? — vsb, Jul 10 '17 at 08:23
thanks @Prem and@HNSKD df %>% group_by(y) %>% unique()%>% slice(1:z[1]) is what i was looking for. — vsb, Jul 10 '17 at 08:49
@vsb You've received a few good answers below. If one of them worked for you, please consider accepting it by clicking on the check mark to the left of the answer. This lets the community know the answer solved your issue and that the issue should be closed. — CPak, Sep 09 '17 at 02:03

Cath · Accepted Answer · 2017-07-10T08:40:39.617

3

A solution with base R:

# df is split according to y, then we keep only the top "z" value (after ordering x) 
# and rbind everything back together:
do.call(rbind, 
        lapply(split(df, df$y), 
               function(df1) df1[order(df1$x, decreasing=TRUE), ][1:unique(df1$z), ]))
#     x y z
#a.1  3 a 2
#a.2  2 a 2
#b    8 b 1
#c.6 11 c 3
#c.7 10 c 3
#c.8  9 c 3

EDIT:
A much more direct way (still in base R) provided in comment by @mt1022:

df[ave(1:nrow(df), df$y, FUN = seq_along) <= df$z, ]
#   x y z
#1  3 a 2
#2  2 a 2
#4  8 b 1
#6 11 c 3
#7 10 c 3
#8  9 c 3

edited Jul 10 '17 at 08:40

answered Jul 10 '17 at 08:22

Cath

23,906
5
52
86

2

Another base R solution: `df[ave(1:nrow(df), df$y, FUN = seq_along) <= as.numeric(df$z), ]` – mt1022 Jul 10 '17 at 08:31
@mt1022 I should get more familiar with `ave` ;-), this is much better base R solution, you should post that – Cath Jul 10 '17 at 08:33
It is adapted from this one: https://stackoverflow.com/questions/12925063/numbering-rows-within-groups-in-a-data-frame. I just add a filter step. Maybe you can add it to your answer as an alternative way. – mt1022 Jul 10 '17 at 08:35
@mt1022 thanks, I'll add it but feel free to change your mind and post and I'll delete the edit ;-) – Cath Jul 10 '17 at 08:38
@mt1022 but this is assuming the data is _descendingly_ sorted. – Ronak Shah Jul 10 '17 at 08:46
1

@RonakShah, sure. If the data are not already in right order, an additional `order(-as.numeric(df$x))` is required. – mt1022 Jul 10 '17 at 08:49

score 1 · Answer 2 · edited Jul 10 '17 at 08:22

1

One approach with data.table:

library(data.table)
setDT(df)
df[,.(inc=seq_len(.N)<=z,x,z),by=.(y)][inc==T ,-2]
#   y  x z
#1: a  3 2
#2: a  2 2
#3: b  8 1
#4: c 11 3
#5: c 10 3
#6: c  9 3

edited Jul 10 '17 at 08:22

Cath

23,906
5
52
86

answered Jul 10 '17 at 08:15

Erdem Akkas

2,062
10
15

4

Or just `df[, .SD[seq_len(.N) <= z], by = y]` – David Arenburg Jul 10 '17 at 08:30

score 0 · Answer 3 · answered Jul 10 '17 at 08:24

0

A solution with dplyr that uses do:

df %>%
   group_by(y) %>%
   do(head(.,as.numeric(unique(.$z))))

answered Jul 10 '17 at 08:24

CPak

13,260
3
30
48

score 0 · Answer 4 · answered Aug 24 '18 at 23:51

I'm posting the solution I was looking for using dplyr. It is based on @HNSKD:

library(dplyr)
x<-c(3,2,1,8,7,11,10,9,7,5,4)
y<-c("a","a","a", "b","b","c","c","c","c","c","c")
z<-c(2,2,2,1,1,3,3,3,3,3,3)

df<-data.frame(x,y,z)

df %>% group_by(y) %>% slice(1:2)

Which returns the first two elements for each y:

# A tibble: 6 x 3
# Groups:   y [3]
      x y         z
  <dbl> <fct> <dbl>
1     3 a         2
2     2 a         2
3     8 b         1
4     7 b         1
5    11 c         3
6    10 c         3

Selecting top N rows for each group based on value in column

4 Answers4

Linked

Related