group data by column, find A percent value in each group

Question

I have a dataset like this:

year | age | value
----------------------
2005 | 8   | 10
2005 | 8   | 12
2005 | 8   | 30
2005 | 8   | 12
2006 | 5   | 10
2006 | 5   | 20
2006 | 5   | 15
2006 | 5   | 20
2007 | 8   | 16
2007 | 8   | 20
2007 | 8   | 18
2007 | 5   | 20

about 50000 rows

What I want is to get a subset for the highest 30percentage for the values of each subgroup--- like this

year | age | value
----------------------
 2005 | 8   | 30
 2005 | 8   | 30
 2006 | 5   | 20
 2006 | 5   | 20
 2007 | 8   | 20
 2007 | 5   | 20

that means: the rows withe the higest 30% value from year:2005,age:8, year:2005,age:5, year:2006,age:8, year:2005,age:5 ...and so on

score 3 · Answer 1 · answered Jun 03 '16 at 20:31

3

Here is a pretty fast approach using data.table:

library(data.table)

dt <- as.data.table(read.table(header=T,text="year age  value
  2005  8    10
2005  8    12
2005  8    30
2005  8    12
2006  5    10
2006  5    20
2006  5    15
2006  5    20
2007  8    16
2007  8    20
2007  8    18
2007  5    20"))


dt[dt[,.I[value >= quantile(value,0.7)],by=c("year","age")]$V1]

   year age value
1: 2005   8    30
2: 2006   5    20
3: 2006   5    20
4: 2007   8    20
5: 2007   5    20

answered Jun 03 '16 at 20:31

Mike H.

13,960
2
29
39

1

`dt[,.SD[value >= quantile(value,0.7)], by = c("year","age")]` isn't this enough ? – user5249203 Jun 03 '16 at 20:47
Yes it is, I thought the approach in my answer was a faster way to do it, but I just tested it and I guess it isn't. – Mike H. Jun 03 '16 at 20:56
1

Using `.I` is especially an advantage for large datasets. See [here](http://stackoverflow.com/a/34753260/2204410) for a comparison. – Jaap Jun 06 '16 at 07:46

score 3 · Accepted Answer · answered Jun 03 '16 at 20:35

library(dplyr)
df %>% group_by(year, age) %>% filter(value >= quantile(value, probs = 0.7))

Source: local data frame [5 x 3]
Groups: year, age [4]

   year   age value
  (dbl) (dbl) (int)
1  2005     8    30
2  2006     5    20
3  2006     5    20
4  2007     8    20
5  2007     5    20

group data by column, find A percent value in each group

2 Answers2