This question is based on the response given by @Arun here. In the response, @Arun proposes a clever way to avoid creating sparse matrices by simply only looking at occurring pairs, hence avoiding the saving of many zeros and the doubling of pairs A-B and B-A.
The following is copy-pasted from his answer:
Step 1: Construct sample data of your dimensions approximately:
require(data.table) ## 1.9.4+
set.seed(1L) ## For reproducibility
N = 2724098L
motif = sample(paste("motif", 1:1716, sep="_"), N, TRUE)
id = sample(83509, N, TRUE)
DT = data.table(id, motif)
Step 2: Pre-processing:
DT = unique(DT) ## IMPORTANT: not to have duplicate motifs within same id
setorder(DT) ## IMPORTANT: motifs are ordered within id as well
setkey(DT, id) ## reset key to 'id'. Motifs ordered within id from previous step
DT[, runlen := .I]
Step 3: Solution:
ans = DT[DT, {
tmp = runlen < i.runlen;
list(motif[tmp], i.motif[any(tmp)])
},
by=.EACHI][, .N, by="V1,V2"]
Running this works fine provided you have enough memory on your computer. I also humbly admit I have no ideawhat exactly the code is doing to create the wanted results so I'm just looking at input and output, agnostic of the process. When applying the exact same code to my data, what seems to happen is that pairs appear that are not in the original data.
I'm running the following code which is a slightly adapted version of what @Arun had provided. The adaptation is because I need to run the code for 17 different blocks. I.e. I'm looking for which pairs occur within a specific block.
cooc <- data.frame()
for(j in 1:17){
DT <- dt[block == j,c("pid", "cid"), with =F]
DT$pid <- as.factor(DT$pid)
setorder(DT)
setkey(DT,pid)
DT[, runlen := .I]
ans <- DT[DT, {
tmp = runlen < i.runlen;
list(cid[tmp],i.cid[any(tmp)])
},
by= .EACHI][, .N, by="V1,V2"]
ans$block <- j
cooc <- data.table(rbind(cooc,ans))
rm(ans)
}
For as far as I understand the code, it's all identical, just looped with for
to do the same thing for 17 blocks. both pid
and cid
are just integers that identify a variable of interest.
For j = 1
, the following goes:
DT[cid == 39] # cid is my equivalent of motif above and pid is my equivalent of id above
pid cid runlen
20319 39 3614
This suggests there is only one pid
for which cid
equals 39
Now when I look into the resulting ans
datatable I get the following:
ans[V1 == 39]
V1 V2 N block
1: 39 41 1 1
2: 39 42 1 1
3: 39 44 1 1
4: 39 47 1 1
5: 39 7027 1 1
6: 39 7043 1 1
7: 39 7174 1 1
8: 39 9434 1 1
9: 39 11493 1 1
10: 39 18815 1 1
11: 39 18875 1 1
12: 39 18896 1 1
13: 39 18909 1 1
14: 39 18924 1 1
15: 39 18928 1 1
16: 39 18929 1 1
17: 39 18931 1 1
18: 39 18932 1 1
19: 39 19265 1 1
20: 39 19410 1 1
Suddenly, there are 20 occurrences of V1
(if I understand the code correctly, this is the equivalent of what was cid
). Yet in DT
there is only 1 pid
assigned to cid
.
I have no idea how to reproduce this finding so I tried to show what seems to be inconsistent. I don't think the original code has this problem so I'm hoping someone can explain where the additional occurrences of cid == 39
come from, based on the info I have given here.