6

To clarify the question I'll briefly describe the data.

Each row in the data.frame is an observation, and the columns represent variables pertinent to that observation including: what individual was observed, when it was observed, where it was observed, etc. I want to exclude/filter individuals for which there are fewer than 5 observations.

In other words, if there are fewer than 5 rows where individual = x, then I want to remove all rows that contain individual x and reassign the result to a new data.frame. I'm aware of some brute force techniques using something like names == unique(df$individualname) and then subsetting out those names individually and applying nrow to determine whether or not to exclude them...but there has to be a better way. Any help is appreciated, I'm still pretty new to R.

ekad
  • 14,436
  • 26
  • 44
  • 46
Andrew Bade
  • 333
  • 3
  • 10
  • 1
    Welcome to SO! You should produce a [reproducible](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610#5963610) example. – Vincent Bonhomme May 21 '16 at 21:00
  • Thank you! I'll try to include reproducible examples in future questions, but here your response was sufficient without needing one. The actual data is quite unwieldy with over 30 columns and 5,000,000 rows. I see how I could use an example like the one you created to get the point across more clearly in the future, though. – Andrew Bade May 21 '16 at 21:17
  • great ;-). a dummy example is even better (sensu more universal) than a specific one – Vincent Bonhomme May 21 '16 at 21:18
  • The data should be in *this* question, as well as future ones. – Rich Scriven May 21 '16 at 21:34

4 Answers4

9

An example using group_by and filter from dplyr package:

library(dplyr)
df <- data.frame(id=c(rep("a", 2), rep("b", 5), rep("c", 8)),
           foo=runif(15))

> df
id       foo
1   a 0.8717067
2   a 0.9086262
3   b 0.9962453
4   b 0.8980123
5   b 0.1535324
6   b 0.2802848
7   b 0.9366375
8   c 0.8109557
9   c 0.6945285
10  c 0.1012925
11  c 0.6822955
12  c 0.3757085
13  c 0.7348635
14  c 0.3026395
15  c 0.9707223

df %>% group_by(id) %>% filter(n()>= 5) %>% ungroup()
Source: local data frame [13 x 2]

id       foo
(fctr)     (dbl)
1       b 0.9962453
2       b 0.8980123
3       b 0.1535324
4       b 0.2802848
5       b 0.9366375
6       c 0.8109557
7       c 0.6945285
8       c 0.1012925
9       c 0.6822955
10      c 0.3757085
11      c 0.7348635
12      c 0.3026395
13      c 0.9707223

or with base R:

> df[df$id %in% names(which(table(df$id)>=5)), ]
id       foo
3   b 0.9962453
4   b 0.8980123
5   b 0.1535324
6   b 0.2802848
7   b 0.9366375
8   c 0.8109557
9   c 0.6945285
10  c 0.1012925
11  c 0.6822955
12  c 0.3757085
13  c 0.7348635
14  c 0.3026395
15  c 0.9707223

Still in base R, using with is a more elegant way to do the very same thing:

df[with(df, id %in% names(which(table(id)>=5))), ]

or:

subset(df, with(df, id %in% names(which(table(id)>=5))))
Vincent Bonhomme
  • 7,235
  • 2
  • 27
  • 38
2

Another way to do the same thing using the data.table package.

library(data.table)
set.seed(1)
dt <- data.table(id=sample(1:4,20,replace=TRUE),var=sample(1:100,20))

dt1<-dt[,count:=.N,by=id][(count>=5)]
dt2<-dt[,count:=.N,by=id][(count<5)]

dt1
    id var count
 1:  2  94     5
 2:  2  22     5
 3:  3  64     5
 4:  4  13     6
 5:  4  37     6
 6:  4   2     6
 7:  3  36     5
 8:  3  81     5
 9:  3  90     5
10:  2  17     5
11:  4  72     6
12:  2  57     5
13:  3  67     5
14:  4   9     6
15:  2  60     5
16:  4  34     6


dt2
   id var count
1:  1  26     4
2:  1  31     4
3:  1  44     4
4:  1  54     4
Mike H.
  • 13,960
  • 2
  • 29
  • 39
2

It can be also with data.table using a logical condition with if after grouping by 'id'

library(data.table)
setDT(df)[, if(.N >=5) .SD, id]
#  id       foo
# 1:  b 0.9962453
# 2:  b 0.8980123
# 3:  b 0.1535324
# 4:  b 0.2802848
# 5:  b 0.9366375
# 6:  c 0.8109557
# 7:  c 0.6945285
# 8:  c 0.1012925
# 9:  c 0.6822955
#10:  c 0.3757085
#11:  c 0.7348635
#12:  c 0.3026395
#13:  c 0.9707223

data

df <- structure(list(id = c("a", "a", "b", "b", "b", "b", "b", "c", 
"c", "c", "c", "c", "c", "c", "c"), foo = c(0.8717067, 0.9086262, 
0.9962453, 0.8980123, 0.1535324, 0.2802848, 0.9366375, 0.8109557, 
0.6945285, 0.1012925, 0.6822955, 0.3757085, 0.7348635, 0.3026395, 
0.9707223)), .Names = c("id", "foo"), class = "data.frame", 
row.names = c(NA, -15L))
akrun
  • 874,273
  • 37
  • 540
  • 662
0

you can also use table. take for instance the data.frame mtcars

table(mtcars$cyl)

you will see that cyl has 3 values 4 6 8. there are 7 cars with 6 cylinders and if you want to exclude observations with less than 10 then you can exclude the cars with 6 cylinders like that

mtcars[!mtcars$cyl%in%names(table(mtcars$cyl)[table(mtcars$cyl)<10]),]

this will exclude observations using %in% names and table alone