Selecting top N values within a group in a column using R

Question

I need to select top two values for each group[yearmonth] value from the following data frame in R. I have already sorted the data by count and yearmonth.How can I achieve that in following data?

 yearmonth  name    count
1   201310  Dovas   5
2   201310  Indulgd 2
3   201310  Justina 1
4   201310  Jolita  1
5   201311  Shahrukh Sheikh 1
6   201311  Dovas   29
7   201311  Justina 13
8   201311  Lina    8
9   201312  sUPERED 7
10  201312  John Hansen 7
11  201312  Lina D. 6
12  201312  joanna1st   5

Does this answer your question? [Select the top N values by group](https://stackoverflow.com/questions/14800161/select-the-top-n-values-by-group) — camille, Sep 21 '21 at 02:14

akrun · Accepted Answer · 2014-10-30T06:58:57.350

7

Or using data.table (mydf from @jazzurro's post). Some options are

  library(data.table)
  setDT(mydf)[order(yearmonth,-count), .SD[1:2], by=yearmonth]

Or

   setDT(mydf)[mydf[order(yearmonth, -count), .I[1:2], by=yearmonth]$V1,]

Or

   setorder(setkey(setDT(mydf), yearmonth), yearmonth, -count)[
                                          ,.SD[1:2], by=yearmonth]
  #    yearmonth        name count
  #1:    201310       Dovas     5
  #2:    201310     Indulgd     2
  #3:    201311       Dovas    29
  #4:    201311     Justina    13
  #5:    201312     sUPERED     7
  #6:    201312 John Hansen     7

edited Oct 30 '14 at 06:58

answered Oct 30 '14 at 06:52

akrun

874,273
37
540
662

@jazzurro I guess the `setorder` would be a bit faster. – akrun Oct 30 '14 at 07:03
Got it. Thank you very much. I will write down all of them and your comment. Thank you for your generous support. – jazzurro Oct 30 '14 at 07:07
Seems both methods are not taking ties into account. Understand it's not mentioned in the question. – KFB Oct 30 '14 at 07:12
@KFB I was only working with the example data. Perhaps, it can be done with using `rank` and specifying `ties.method` – akrun Oct 30 '14 at 07:16

jazzurro · Answer 2 · 2014-10-30T05:09:32.733

Here is one way:

library(dplyr)

mydf %>%
    group_by(yearmonth) %>%
    arrange(desc(count)) %>%
    slice(1:2)

#  yearmonth        name count
#1    201310       Dovas     5
#2    201310     Indulgd     2
#3    201311       Dovas    29
#4    201311     Justina    13
#5    201312     sUPERED     7
#6    201312 John Hansen     7

DATA

mydf <- data.frame(yearmonth = rep(c("201310", "201311", "201312"), each = 4),
                   name = c("Dovas", "Indulgd", "Justina", "Jolita", "Shahrukh Sheikh",
                         "Dovas", "Justina", "Lina", "sUPERED", "John Hansen",
                         "Lina D.", "joanna1st"),
                   count = c(5,2,1,1,1,29,13,8,7,7,6,5),
                   stringsAsFactors = FALSE)

How I do to select the general TOP 10? – Curious G. Jun 24 '19 at 01:31 — Curious G., Jun 24 '19 at 01:31

score 2 · Answer 3 · answered Oct 30 '14 at 08:03

2

Using base R you could do something like:

# sort the data, skip if already done
df <- df[order(df$yearmonth, df$count, decreasing = TRUE),]

Then, to get the top two elements:

df[ave(df$count, df$yearmonth, FUN = seq_along) <= 2, ]

answered Oct 30 '14 at 08:03

talat

68,970
21
126
157

Selecting top N values within a group in a column using R

3 Answers3

Linked