0

How to drop groups when there are not enough observations? In the following reproducible example, each person (identified by name) has 10 observations:

install.packages('randomNames') # install package if required
install.packages('data.table')  # install package if required
lapply(c('data.table', 'randomNames'), require, character.only = TRUE) # load packages

set.seed(1)
testDT <- data.table( date = rep(seq(as.Date("2010/1/1"), as.Date("2019/1/1"), "years"),10),
                      name = rep(randomNames(10, which.names='first'), times=1, each=10),
                      Y    =  runif(100, 5, 15),
                      X    =  rnorm(100, 2, 9),
testDT <- testDT[ X > 0]

Now I want to keep only the persons with at least 6 observations, so Gracelline, Anna, Aesha and Michael must be removed, because they have only 3, 2, 4 and 5 observations respectively.

  testDT[, length(X), by=name]
            name V1
   1:      Blake  6
   2:  Alexander  6
   3:     Leigha  8
   4: Gracelline  3
   5:   Epifanio  7
   6:     Keasha  6
   7:      Robyn  6
   8:       Anna  2
   9:      Aesha  4
  10:    Michael  5

How do I do this in an automatic way (real dataset is much larger)?

Edit:

Yes it's a duplicate. :( The last proposed method was the fastest one.

> system.time(testDT[, .SD[.N>=6], by = name])
   user  system elapsed 
  0.293   0.227   0.517 
> system.time(testDT[testDT[, .I[.N>=6], by = name]$V1])
   user  system elapsed 
  0.163   0.243   0.415 
> system.time(testDT[,if(.N>=6) .SD , by = name])
   user  system elapsed 
  0.073   0.323   0.399 
Konstantinos
  • 4,096
  • 3
  • 19
  • 28

1 Answers1

1

We group by 'name', get the nrow (.N), and if it is greater than 6, we Subset the Data.table (.SD).

testDT[,if(.N>=6) .SD , by = name]
#       name       date         Y           X
# 1:     Blake 2010-01-01  9.820801  3.69913070
# 2:     Blake 2012-01-01  9.935413 15.18999375
# 3:     Blake 2013-01-01  6.862176  3.37928004
# 4:     Blake 2014-01-01 13.273733 21.55350503
# 5:     Blake 2015-01-01 11.684667  6.27958576
# 6:     Blake 2017-01-01  6.079436  7.49653718
# 7: Alexander 2010-01-01 13.209463  4.62301612
# 8: Alexander 2012-01-01 12.829328  2.00994816
# 9: Alexander 2013-01-01 10.530363  2.66907192
#10: Alexander 2016-01-01  5.233312  0.78339246
#11: Alexander 2017-01-01  9.772301 12.60278297
#12: Alexander 2019-01-01 11.927316  7.34551569
#13:    Leigha 2010-01-01  9.776196  4.99655334
#14:    Leigha 2011-01-01 13.612095 11.56789854
#15:    Leigha 2013-01-01  7.447973  5.33016929
#16:    Leigha 2014-01-01  5.706790  4.40388912
#17:    Leigha 2016-01-01  8.162717 12.87081025
#18:    Leigha 2017-01-01 10.186343 12.44362354
#19:    Leigha 2018-01-01 11.620051  8.30192285
#20:    Leigha 2019-01-01  9.068302 16.28150109
#21:  Epifanio 2010-01-01  8.390729 17.90558542
#22:  Epifanio 2011-01-01 13.394404  8.45036728
#23:  Epifanio 2012-01-01  8.466835 10.19156807
#24:  Epifanio 2013-01-01  8.337749  5.45766822
#25:  Epifanio 2014-01-01  9.763512 17.13958472
#26:  Epifanio 2017-01-01  8.899895 14.89054015
#27:  Epifanio 2019-01-01 14.606180  0.13357331
#28:    Keasha 2013-01-01  8.253522  6.44769498
#29:    Keasha 2014-01-01 12.570871  0.40402566
#30:    Keasha 2016-01-01 12.111212 14.08734943
#31:    Keasha 2017-01-01  6.216919  0.06878532
#32:    Keasha 2018-01-01  7.454885  0.38399123
#33:    Keasha 2019-01-01  6.433044  1.09828333
#34:     Robyn 2010-01-01  7.396294  8.41399676
#35:     Robyn 2011-01-01  5.589344  1.33792036
#36:     Robyn 2012-01-01 11.422883  1.66129246
#37:     Robyn 2015-01-01 12.973088  2.54144396
#38:     Robyn 2017-01-01  9.100841  6.78346573
#39:     Robyn 2019-01-01 11.049333  4.75902075

Or instead of if, we can directly use .N>1 and wrap with `.SD

testDT[, .SD[.N>=6], by = name]

it could be a little slow, so another option would be .I to get the row index and then subset

testDT[testDT[, .I[.N>=6], by = name]$V1]
akrun
  • 874,273
  • 37
  • 540
  • 662