0

I have a data frame with 3 columns. a,b,c. There are multiple rows corresponding to each unique value of column a. I want to select top 5 rows corresponding to each unique value of column a. column c is some value and the data frame is already sorted by it in descending order, so that would not be a problem. Can anyone please suggest how can I do this in R.

Kunal Batra
  • 1,001
  • 3
  • 15
  • 23
  • 2
    a combination of `plyr::ddply` and `head` or `data.table`. There are plenty of examples on SO – mnel Aug 30 '12 at 07:00
  • 1
    and if, after you look for SO posts that are similar to your question, you still can't find a helpful answer, a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) would help a lot – BenBarnes Aug 30 '12 at 07:16

2 Answers2

4

Stealing @ptocquin's example, here's how you can use base function by. You can flatten the result using do.call (see below).

> by(data = data, INDICES = data$a, FUN = function(x) head(x, 5))
# or by(data = data, INDICES = data$a, FUN = head, 5)
data$a: 1
   a          b         c
21 1  0.1188552 1.6389895
41 1  1.0182033 1.4811359
61 1 -0.8795879 0.7784072
81 1  0.6485745 0.7734652
31 1  1.5102255 0.7107957
------------------------------------------------------------ 
data$a: 2
   a           b          c
15 2 -1.09704040  1.1710693
85 2  0.42914795  0.8826820
65 2 -1.01480957  0.6736782
45 2 -0.07982711  0.3693384
35 2 -0.67643885 -0.2170767
------------------------------------------------------------ 

A similar thing could be achieved by splitting your data.frame based on a and then using lapply to step through each element subsetting first n rows.

split.data <- split(data, data$a)
subsetted.data <- lapply(split.data, FUN = function(x) head(x, 5)) # or ..., FUN = head, 5) like above
flatten.data <- do.call("rbind", subsetted.data)
head(flatten.data)
       a           b           c
1.21   1  0.11885516  1.63898947
1.41   1  1.01820329  1.48113594
1.61   1 -0.87958790  0.77840718
1.81   1  0.64857445  0.77346517
1.31   1  1.51022545  0.71079568
2.15   2 -1.09704040  1.17106930
2.85   2  0.42914795  0.88268205
2.65   2 -1.01480957  0.67367823
2.45   2 -0.07982711  0.36933837
2.35   2 -0.67643885 -0.21707668
Roman Luštrik
  • 69,533
  • 24
  • 154
  • 197
1

Here is my try :

library(plyr)
data <- data.frame(a=rep(sample(1:20,10),10),b=rnorm(100),c=rnorm(100))
data <- data[rev(order(data$c)),]
head(data, 15)

 a           b        c
28  6  1.69611039 1.720081
91 11  1.62656460 1.651574
70  9 -1.17808386 1.641954
6  15  1.23420550 1.603140
23  7  0.70854914 1.588352
51 11 -1.41234359 1.540738
19 10  2.83730734 1.522825
49 10  0.39313579 1.370831
80  9 -0.59445323 1.327825
59 10 -0.55538404 1.214901
18  6  0.08445888 1.152266
86 15  0.53027267 1.066034
69 10 -1.89077464 1.037447
62  1 -0.43599566 1.026505
3   7  0.78544009 1.014770

result <- ddply(data, .(a), "head", 5)
head(result, 15)

   a           b           c
1  1 -0.43599566  1.02650544
2  1 -1.55113486  0.36380251
3  1  0.68608364  0.30911430
4  1 -0.85406406  0.05555500
5  1 -1.83894595 -0.11850847
6  5 -1.79715809  0.77760033
7  5  0.82814909  0.22401278
8  5 -1.52726859  0.06745849
9  5  0.51655092 -0.02737905
10 5 -0.44004646 -0.28106808
11 6  1.69611039  1.72008079
12 6  0.08445888  1.15226601
13 6 -1.99465060  0.82214319
14 6  0.43855489  0.76221979
15 6 -2.15251353  0.64417757
ptocquin
  • 325
  • 3
  • 9