3

I recently started to use the data.table package to identify values in a table's column that conform to some conditions. Although and I manage to get most of the things done, now I'm stuck with this problem:

I have a data table, table1, in which the first column (labels) is a group ID, and the second column, o.cell, is an integer. The key is on "labels"

I have another data table, table2, containing a single column: "cell".

Now, I'm trying to find, for each group in table1, the values from the column "o.cell" that are in the "cell" column of table2. table1 has some 400K rows divided into 800+ groups of unequal sizes. table2 has about 1.3M rows of unique cell numbers. Cell numbers in column "o.cell" table1 can be found in more than one group.

This seems like a simple task but I can't find the right way to do it. Depending on the way I structure my call, it either gives me a different result than what I expect or it never completes and I have to end R task because it's frozen (my machine has 24 GB RAM).

Here's an example of one of the "variant" of the calls I have tried:

overlap <- table1[, list(over.cell =
              o.cell[!is.na(o.cell) & o.cell %in% table2$cell]),
              by = labels]

I pretty sure this is the wrong way to use data tables for this task and on top of that I can't get the result I want.

I will greatly appreciate any help. Thanks.

IRTFM
  • 258,963
  • 21
  • 364
  • 487
Guilôme
  • 177
  • 2
  • 11
  • 5
    Hello and welcome to SO. To help make a reproducible example, you can use `reproduce()` -- The function is `data.table` aware & Instructions are here: http://bit.ly/SORepro - [How to make a great R reproducible example](http://bit.ly/SORepro) – Ricardo Saporta Oct 29 '13 at 20:38
  • Did you run some test on a (much) smaller case to check if your code is correct ? It's very hard to say what happens when you perform operations on 400k rows... – digEmAll Oct 29 '13 at 20:45
  • 1
    Shot in the dark: just tag the columns before doing your grouping business: `setkey(table1,o.cell); table1[J(table2$cell),isin2:=TRUE]`. Then you can do `table2[isin2,{do something useful},by=labels]`. And if your tables are large, you probably want to avoid creating duplicated information with `<-` – Frank Oct 29 '13 at 21:11
  • 1
    @Frank, your suggestion does exactly what I want, thanks! – Guilôme Oct 30 '13 at 18:25
  • 1
    @RicardoSaporta: thanks for the link to the reproduce function. I had never heard about it. It's great and I will sure use it in my future posts. – Guilôme Oct 30 '13 at 18:27
  • @user2933566 Glad it helped! I've elaborated a bit in an answer below. – Frank Oct 30 '13 at 19:12
  • @user2933566, no prob. Also, don't forget to upvote the helpful answers – Ricardo Saporta Oct 30 '13 at 19:24

1 Answers1

1

Sounds like this is your set up:

dt1 = data.table(labels = c('a','b'), o.cell = 1:10)
dt2 = data.table(cell = 4:7)

And you simply want to do a simple merge:

setkey(dt1, o.cell)
dt1[dt2]
#   o.cell labels
#1:      4      b
#2:      5      a
#3:      6      b
#4:      7      a
eddi
  • 49,088
  • 6
  • 104
  • 155