I've been trying the different solutions with microbenchmark:
prueba <- data.table(id=rep(c(1,1,1,1,2,2,3,3,4),1000000), kk=rep(c("FA", "N","N","N",NA,"FA","N", "FA", "N"),1000000), rrr=rep(1:9),1000000)
prueba[, if(any(kk == "FA")) .SD, by= id] # docendo
prueba[id %in% unique(prueba[kk == "FA", id])] # lmo
prueba[id %in% prueba[, .I[kk == "FA"], by = id]$id,] # eddi
prueba[id %in% prueba[,any(kk=="FA", na.rm=T),by=id]
$id[prueba[,any(kk=="FA", na.rm=T),by=id]$V1],] # skan
prueba %>% group_by(id) %>% filter('FA'%in%kk) # Andrew
prueba[prueba[kk == "FA", .(id)], on="id"] # lmo
.
min lq mean median uq max name
2.206436 2.211022 2.258038 2.215607 2.283839 2.352071 docendo
1.456590 1.472334 1.596654 1.488077 1.666687 1.845296 lmo
2.767113 2.869260 2.953024 2.971408 3.045980 3.120552 eddi
3.431671 3.437914 3.451760 3.444157 3.461804 3.479451 skan
2.088516 2.247807 2.313196 2.407098 2.425535 2.443973 Andrew
The last solution by lmo doesn't work, it says:
Error in vecseq(f__, len__, if (allow.cartesian || notjoin ||
!anyDuplicated(f__, : Join results in more than 2^31 rows
(internal vecseq reached physical limit). Very likely misspecified
join. Check for duplicate key values in i each of which join to the
same group in x over and over again. If that's ok, try by=.EACHI to
run j for each group to avoid the large allocation.
I expected to see a much bigger difference between methods. Maybe with a different dataset.
The fastest method so far seems to be:
prueba[id %in% unique(prueba[kk == "FA", id])]
I guess there must be better options using .I, .GRP or such functions.