1

Say we have this toy data.table

prueba  <- data.table(id=c(1,1,1,1,2,2,3,3,4), kk=c("FA", "N","N","N",NA,"FA","N", "FA", "N"), rrr=1:9)

id kk rrr
1 FA   1
1  N   2
1  N   3
1  N   4
2 NA   5
2 FA   6
3  N   7
3 FA   8
4  N   9

And we want to retrieve all rows pertaining of a given "id" if that id contains any "FA" value on the kk column.

I've got to do it in this way:

prueba[id %in% prueba[,any(kk=="FA", na.rm=T),
   by=id]$id[prueba[,any(kk=="FA", na.rm=T),by=id]$V1],]

id kk rrr
1 FA   1
1  N   2
1  N   3
1  N   4
2 NA   5
2 FA   6
3  N   7
3 FA   8

(We get all rows with id=1,2 and 3).

But I think it's too long and not optimized.

How would you do it easily with data.table?

skan
  • 7,423
  • 14
  • 59
  • 96
  • 1
    With a join, you could do `prueba[prueba[kk == "FA", .(id)], on="id"]` or use `unique` if there are multiple instances of FA per ID: `prueba[unique(prueba[kk == "FA", .(id)]), on="id"]`. – lmo Jun 27 '17 at 14:40
  • 3
    you can also use `prueba[, if(any(kk == "FA")) .SD, by= id]` – talat Jun 27 '17 at 14:43
  • Also `prueba[id %in% unique(prueba[kk == "FA", id])]` will work. – lmo Jun 27 '17 at 14:45
  • @docendodiscimus I find your solution is the simplest one. If you post it as an answer I will select it. – skan Jun 27 '17 at 14:53
  • @skan, please feel free to post it yourself, perhaps together with other suggestions in comments – talat Jun 27 '17 at 15:14
  • this question is a special case of https://stackoverflow.com/questions/16573995/subset-by-group-with-data-table/16574176 – eddi Jun 27 '17 at 20:03
  • @eddi how would you adapt that solution here? prueba[id %in% prueba[, .I[kk == "FA"], by = id]$id,] ? – skan Jun 27 '17 at 20:53
  • `prueba[, .SD[any(kk == 'FA')], by = id]` or `prueba[prueba[, .I[any(kk == 'FA')], by = id]$V1]` – eddi Jun 27 '17 at 21:10

3 Answers3

1

I'm not sure about optimized, but cleaned up and using dplyr:

library(dplyr)
prueba %>% 
    group_by(id) %>% 
    filter('FA'%in%kk)

# A tibble: 8 x 3
# Groups:   id [3]
     id    kk   rrr
  <dbl> <chr> <int>
1     1    FA     1
2     1     N     2
3     1     N     3
4     1     N     4
5     2  <NA>     5
6     2    FA     6
7     3     N     7
8     3    FA     8
Andrew Taylor
  • 3,438
  • 1
  • 26
  • 47
  • 1
    `dplyr::flter` should be pretty fast in my experience – Gregor de Cillia Jun 27 '17 at 14:43
  • I assume it should be fast. But I never get into the speed/optimization conversations as my data almost never requires that focus. So I'm not going to make an argument that I can't back up. – Andrew Taylor Jun 27 '17 at 14:44
  • 2
    question was about `data.table`, not `dplyr` – mtoto Jun 27 '17 at 14:45
  • Yeah, I missed that somehow. I mean, I noticed data.table had been used to build and mutate the data, but I didn't notice the tag or it in the title. And I've had plenty of coffee. I suppose I'll just delete this then. – Andrew Taylor Jun 27 '17 at 14:48
1

For a data.table case I would simplify your code to:

prueba  <- data.table(id=c(1,1,1,1,2,2,3,3,4), kk=c("FA", "N","N","N",NA,"FA","N", "FA", "N"), rrr=1:9)  

prueba[id %in% unique(prueba[kk=="FA",id])]

The output is:

   id kk rrr
1:  1 FA   1
2:  1  N   2
3:  1  N   3
4:  1  N   4
5:  2 NA   5
6:  2 FA   6
7:  3  N   7
8:  3 FA   8 
NpT
  • 451
  • 4
  • 11
1

I've been trying the different solutions with microbenchmark:

prueba  <- data.table(id=rep(c(1,1,1,1,2,2,3,3,4),1000000), kk=rep(c("FA", "N","N","N",NA,"FA","N", "FA", "N"),1000000), rrr=rep(1:9),1000000)

prueba[, if(any(kk == "FA")) .SD, by= id]               # docendo
prueba[id %in% unique(prueba[kk == "FA", id])]          # lmo
prueba[id %in% prueba[, .I[kk == "FA"], by = id]$id,]   # eddi
prueba[id %in% prueba[,any(kk=="FA", na.rm=T),by=id]
   $id[prueba[,any(kk=="FA", na.rm=T),by=id]$V1],]      # skan
prueba %>%   group_by(id) %>%   filter('FA'%in%kk)      # Andrew
prueba[prueba[kk == "FA", .(id)], on="id"]              # lmo

.

min       lq       mean     median       uq     max    name
2.206436 2.211022 2.258038 2.215607 2.283839 2.352071   docendo
1.456590 1.472334 1.596654 1.488077 1.666687 1.845296   lmo
2.767113 2.869260 2.953024 2.971408 3.045980 3.120552   eddi
3.431671 3.437914 3.451760 3.444157 3.461804 3.479451   skan
2.088516 2.247807 2.313196 2.407098 2.425535 2.443973   Andrew

The last solution by lmo doesn't work, it says:

Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__, : Join results in more than 2^31 rows (internal vecseq reached physical limit). Very likely misspecified join. Check for duplicate key values in i each of which join to the same group in x over and over again. If that's ok, try by=.EACHI to run j for each group to avoid the large allocation.

I expected to see a much bigger difference between methods. Maybe with a different dataset. The fastest method so far seems to be:

prueba[id %in% unique(prueba[kk == "FA", id])] 

I guess there must be better options using .I, .GRP or such functions.

skan
  • 7,423
  • 14
  • 59
  • 96