How to drop groups when there are not enough observations?

Question

How to drop groups when there are not enough observations? In the following reproducible example, each person (identified by name) has 10 observations:

install.packages('randomNames') # install package if required
install.packages('data.table')  # install package if required
lapply(c('data.table', 'randomNames'), require, character.only = TRUE) # load packages

set.seed(1)
testDT <- data.table( date = rep(seq(as.Date("2010/1/1"), as.Date("2019/1/1"), "years"),10),
                      name = rep(randomNames(10, which.names='first'), times=1, each=10),
                      Y    =  runif(100, 5, 15),
                      X    =  rnorm(100, 2, 9),
testDT <- testDT[ X > 0]

Now I want to keep only the persons with at least 6 observations, so Gracelline, Anna, Aesha and Michael must be removed, because they have only 3, 2, 4 and 5 observations respectively.

  testDT[, length(X), by=name]
            name V1
   1:      Blake  6
   2:  Alexander  6
   3:     Leigha  8
   4: Gracelline  3
   5:   Epifanio  7
   6:     Keasha  6
   7:      Robyn  6
   8:       Anna  2
   9:      Aesha  4
  10:    Michael  5

How do I do this in an automatic way (real dataset is much larger)?

Edit:

Yes it's a duplicate. :( The last proposed method was the fastest one.

> system.time(testDT[, .SD[.N>=6], by = name])
   user  system elapsed 
  0.293   0.227   0.517 
> system.time(testDT[testDT[, .I[.N>=6], by = name]$V1])
   user  system elapsed 
  0.163   0.243   0.415 
> system.time(testDT[,if(.N>=6) .SD , by = name])
   user  system elapsed 
  0.073   0.323   0.399

akrun · Accepted Answer · 2016-01-16T18:15:32.907

We group by 'name', get the nrow (.N), and if it is greater than 6, we Subset the Data.table (.SD).

testDT[,if(.N>=6) .SD , by = name]
#       name       date         Y           X
# 1:     Blake 2010-01-01  9.820801  3.69913070
# 2:     Blake 2012-01-01  9.935413 15.18999375
# 3:     Blake 2013-01-01  6.862176  3.37928004
# 4:     Blake 2014-01-01 13.273733 21.55350503
# 5:     Blake 2015-01-01 11.684667  6.27958576
# 6:     Blake 2017-01-01  6.079436  7.49653718
# 7: Alexander 2010-01-01 13.209463  4.62301612
# 8: Alexander 2012-01-01 12.829328  2.00994816
# 9: Alexander 2013-01-01 10.530363  2.66907192
#10: Alexander 2016-01-01  5.233312  0.78339246
#11: Alexander 2017-01-01  9.772301 12.60278297
#12: Alexander 2019-01-01 11.927316  7.34551569
#13:    Leigha 2010-01-01  9.776196  4.99655334
#14:    Leigha 2011-01-01 13.612095 11.56789854
#15:    Leigha 2013-01-01  7.447973  5.33016929
#16:    Leigha 2014-01-01  5.706790  4.40388912
#17:    Leigha 2016-01-01  8.162717 12.87081025
#18:    Leigha 2017-01-01 10.186343 12.44362354
#19:    Leigha 2018-01-01 11.620051  8.30192285
#20:    Leigha 2019-01-01  9.068302 16.28150109
#21:  Epifanio 2010-01-01  8.390729 17.90558542
#22:  Epifanio 2011-01-01 13.394404  8.45036728
#23:  Epifanio 2012-01-01  8.466835 10.19156807
#24:  Epifanio 2013-01-01  8.337749  5.45766822
#25:  Epifanio 2014-01-01  9.763512 17.13958472
#26:  Epifanio 2017-01-01  8.899895 14.89054015
#27:  Epifanio 2019-01-01 14.606180  0.13357331
#28:    Keasha 2013-01-01  8.253522  6.44769498
#29:    Keasha 2014-01-01 12.570871  0.40402566
#30:    Keasha 2016-01-01 12.111212 14.08734943
#31:    Keasha 2017-01-01  6.216919  0.06878532
#32:    Keasha 2018-01-01  7.454885  0.38399123
#33:    Keasha 2019-01-01  6.433044  1.09828333
#34:     Robyn 2010-01-01  7.396294  8.41399676
#35:     Robyn 2011-01-01  5.589344  1.33792036
#36:     Robyn 2012-01-01 11.422883  1.66129246
#37:     Robyn 2015-01-01 12.973088  2.54144396
#38:     Robyn 2017-01-01  9.100841  6.78346573
#39:     Robyn 2019-01-01 11.049333  4.75902075

Or instead of if, we can directly use .N>1 and wrap with `.SD

testDT[, .SD[.N>=6], by = name]

it could be a little slow, so another option would be .I to get the row index and then subset

testDT[testDT[, .I[.N>=6], by = name]$V1]

How to drop groups when there are not enough observations?

1 Answers1