Counting existing pairs of data in large data table

Question

This question is based on the response given by @Arun here. In the response, @Arun proposes a clever way to avoid creating sparse matrices by simply only looking at occurring pairs, hence avoiding the saving of many zeros and the doubling of pairs A-B and B-A.

The following is copy-pasted from his answer:

Step 1: Construct sample data of your dimensions approximately:

require(data.table) ## 1.9.4+
set.seed(1L)        ## For reproducibility
N = 2724098L
motif = sample(paste("motif", 1:1716, sep="_"), N, TRUE)
id = sample(83509, N, TRUE)
DT = data.table(id, motif)

Step 2: Pre-processing:

DT = unique(DT) ## IMPORTANT: not to have duplicate motifs within same id
setorder(DT)    ## IMPORTANT: motifs are ordered within id as well
setkey(DT, id)  ## reset key to 'id'. Motifs ordered within id from previous step
DT[, runlen := .I]

Step 3: Solution:

ans = DT[DT, {
              tmp = runlen < i.runlen; 
              list(motif[tmp], i.motif[any(tmp)])
             }, 
      by=.EACHI][, .N, by="V1,V2"]

Running this works fine provided you have enough memory on your computer. I also humbly admit I have no ideawhat exactly the code is doing to create the wanted results so I'm just looking at input and output, agnostic of the process. When applying the exact same code to my data, what seems to happen is that pairs appear that are not in the original data.

I'm running the following code which is a slightly adapted version of what @Arun had provided. The adaptation is because I need to run the code for 17 different blocks. I.e. I'm looking for which pairs occur within a specific block.

 cooc <- data.frame()
for(j in 1:17){
  DT <- dt[block == j,c("pid", "cid"), with =F]
  DT$pid <- as.factor(DT$pid)
  setorder(DT)
  setkey(DT,pid)
  DT[, runlen := .I]
                 ans <- DT[DT, {
                            tmp = runlen < i.runlen;
                            list(cid[tmp],i.cid[any(tmp)])
                            },
                        by= .EACHI][, .N, by="V1,V2"]
                              ans$block <- j 
                    cooc <- data.table(rbind(cooc,ans))
                      rm(ans)
                }

For as far as I understand the code, it's all identical, just looped with for to do the same thing for 17 blocks. both pid and cid are just integers that identify a variable of interest.

For j = 1 , the following goes:

DT[cid == 39] # cid is my equivalent of motif above and pid is my equivalent of id above
pid cid runlen
20319 39 3614

This suggests there is only one pid for which cid equals 39

Now when I look into the resulting ans datatable I get the following:

ans[V1 == 39]
    V1    V2 N block
 1: 39    41 1     1
 2: 39    42 1     1
 3: 39    44 1     1
 4: 39    47 1     1
 5: 39  7027 1     1
 6: 39  7043 1     1
 7: 39  7174 1     1
 8: 39  9434 1     1
 9: 39 11493 1     1
10: 39 18815 1     1
11: 39 18875 1     1
12: 39 18896 1     1
13: 39 18909 1     1
14: 39 18924 1     1
15: 39 18928 1     1
16: 39 18929 1     1
17: 39 18931 1     1
18: 39 18932 1     1
19: 39 19265 1     1
20: 39 19410 1     1

Suddenly, there are 20 occurrences of V1 (if I understand the code correctly, this is the equivalent of what was cid). Yet in DT there is only 1 pid assigned to cid.

I have no idea how to reproduce this finding so I tried to show what seems to be inconsistent. I don't think the original code has this problem so I'm hoping someone can explain where the additional occurrences of cid == 39 come from, based on the info I have given here.

In my opinion, you should try to understand Arun's code before extending it or coming here to ask a question about your extension... (I'm not saying I understand it either.) Also, the example seems to fall short of reproducibility once you introduce this `dt` object with a `block` variable (that I don't see created anywhere). — Frank, Sep 02 '15 at 16:08
Glad that you're trying the data.table solution. But have you seen [this comment](http://stackoverflow.com/questions/26244685/count-every-possible-pair-of-values-in-a-column-grouped-by-multiple-columns?lq=1#comment41199789_26246588)? I'll write back about the data.table answer after I've had a look. — Arun, Sep 02 '15 at 17:37
@Frank, I agree the example is not reproducible because I cannot explain where the anomaly comes from. I did my best. Also the code extension in itself is actually irrelevant because is not the cause of the problem. Even when taking the loop away and merely looking at `j = 1` as I did in the example I give the problem persists.But when thinking about it, I just realized where I went wrong. Sorry, consider question solved. — SJDS, Sep 03 '15 at 03:13

score 1 · Accepted Answer · answered Sep 02 '15 at 18:41

Two things:

First, I don't understand what's wrong with the result you get. Starting from

require(data.table)
set.seed(1L)
N = 2724098L
motif = sample(paste("motif", 1:1716, sep="_"), N, TRUE)
id = sample(83509, N, TRUE)
DT = data.table(id, motif)

let me recreate the data that helps answer your Q.

# keep only one of 'motif_456'
DT2 = rbind(DT[1L], DT[motif != "motif_456"])
DT2[1L]
#       id     motif
# 1: 49338 motif_456

DT2[ , .N, by=motif]
#            motif    N
#    1:  motif_456    1
#    2:  motif_639 1637
#    3:  motif_984 1649
#    4: motif_1559 1531
#    5:  motif_347 1603
#   ---                
# 1712:   motif_46 1623
# 1713:  motif_521 1613
# 1714:  motif_803 1568
# 1715:  motif_603 1573
# 1716:  motif_461 1591

Let's check all motifs corresponding to id = 49338:

DT2[id == 49338, motif]
#  [1] "motif_456"  "motif_553"  "motif_1048" "motif_1680" "motif_171"  "motif_1706"
#  [7] "motif_707"  "motif_163"  "motif_489"  "motif_107"  "motif_1419" "motif_595" 
# [13] "motif_1223" "motif_1274" "motif_1164" "motif_427"  "motif_505"  "motif_1041"
# [19] "motif_1321" "motif_1231" "motif_1498" "motif_837"  "motif_298"  "motif_649" 
# [25] "motif_631"

So obviously for all these motifs' combination with motif_456 the result should be 1. And that's what the data.table solution provides. Here's the relevant result after running data.table solution:

# data.table solution takes 11.2 secs
ans[V1 == "motif_456", .N] + ans[V2 == "motif_456", .N]
# [1] 24

Second, while the data.table answer does well, we can do this more efficiently with the solution shown by @nograpes. Let's try it on DT2:

require(Matrix)
DT2[, names(DT2) := lapply(.SD, as.factor)]

s <- sparseMatrix(
      as.integer(DT2$id), 
      as.integer(DT2$motif),
      dimnames = list(levels(DT2$id),levels(DT2$motif)),
  x = TRUE)

co.oc <- t(s) %*% s # Find co-occurrences.
tab <- summary(co.oc) # Create triplet representation.
tab <- tab[tab$i < tab$j,] # Extract upper triangle of matrix

ans = setDT(list(motif1 = levels(DT2$motif)[tab$i],
                  motif2 = levels(DT2$motif)[tab$j],
                  number = tab$x))

# Matrix solution takes 2.4 secs
ans[motif1 == "motif_456", .N] + ans[motif2 == "motif_456", .N]
# [1] 24

thanks for the detailed feedback. I'm not sure it solves my problem however. You are looking at all motifs with a specific id, while I am trying to figure out why a motif (created in V1) that only occurs once in the original dataset (`dt`) suddenly appears paired with 19 other motifs (in V2). I've looked into the sparse matrix solution as well (I asked another question about that) and the trouble is there that it requires transforming the variables to factors, as the dimensions are determined by the product of the highest variable values. — SJDS, Sep 03 '15 at 03:10
Hi @Arun , I was just looking over the solution again and you are right. The answer is correct. I can't believe I missed this yesterday, I guess spending 15hours in the office can do that to a person :( . Anyway thanks heaps for your explanation. It suddenly became all clear ! — SJDS, Sep 03 '15 at 03:14

Counting existing pairs of data in large data table

1 Answers1