How to drop groups when there are not enough observations?
In the following reproducible example, each person (identified by name
) has 10 observations:
install.packages('randomNames') # install package if required
install.packages('data.table') # install package if required
lapply(c('data.table', 'randomNames'), require, character.only = TRUE) # load packages
set.seed(1)
testDT <- data.table( date = rep(seq(as.Date("2010/1/1"), as.Date("2019/1/1"), "years"),10),
name = rep(randomNames(10, which.names='first'), times=1, each=10),
Y = runif(100, 5, 15),
X = rnorm(100, 2, 9),
testDT <- testDT[ X > 0]
Now I want to keep only the persons with at least 6 observations, so Gracelline, Anna, Aesha and Michael must be removed, because they have only 3, 2, 4 and 5 observations respectively.
testDT[, length(X), by=name]
name V1
1: Blake 6
2: Alexander 6
3: Leigha 8
4: Gracelline 3
5: Epifanio 7
6: Keasha 6
7: Robyn 6
8: Anna 2
9: Aesha 4
10: Michael 5
How do I do this in an automatic way (real dataset is much larger)?
Edit:
Yes it's a duplicate. :( The last proposed method was the fastest one.
> system.time(testDT[, .SD[.N>=6], by = name])
user system elapsed
0.293 0.227 0.517
> system.time(testDT[testDT[, .I[.N>=6], by = name]$V1])
user system elapsed
0.163 0.243 0.415
> system.time(testDT[,if(.N>=6) .SD , by = name])
user system elapsed
0.073 0.323 0.399