optimizing code in R for vector comparisons in data.table

Question

As part of my program in R, I have to compare a huge number of pair of sentences with some functions (the one im showing here is comparing sentences with the same number of words, and whether there is just exactly one different word between those two sentences)

To make things faster, I have already converted all words into integers so I am dealing with integer vectors so the example function is a very simple one

is_sub_num <- function(a,b){sum(!(a==b))==1}

where a,b are character vectors such as

a = c(1,2,3); b=c(1,4,3) 
is_sub_num(a,b)
# [1] TRUE

my data will be stored in a data.table

Classes ‘data.table’ and 'data.frame':  100 obs. of  2 variables:
 $ ID: int  1 2 3 4 5 6 7 8 9 10 ...
 $ V2:List of 100
  ..$ : int  4 4 3 4
  ..$ : int  1 2 3 1

the length of each entry may be different (in the example below, the entries are all of size 4)

I have a table with candidate pair IDs to test the corresponding entries in DT with the function above as follow

is_pair_ok  <- function(pair){
            is_sub_num(DT[ID==pair[1],V2][[1]],DT[ID==pair[2],V2][[1]])}

here is a simplification of what I'm trying to do:

set.seed=234
z = lapply(1:100, function(x) sample(1:4,size=4,replace=TRUE))
is_sub_num <- function(a,b){sum(!(a==b))==1}
is_pair_ok  <- function(pair){
        is_sub_num(DT[ID==pair[1],V2][[1]],DT[ID==pair[2],V2][[1]])}

pair_list <- as.data.table(cbind(sample(1:100,10000,replace=TRUE),sample(1:100,10000,replace=TRUE)))

DT <- as.data.table(1:100)
DT$V2 <- z
colnames(DT) <- c("ID","V2")

print(system.time(tmp <-apply(pair_list,1,is_pair_ok)))

this takes around 22 seconds on my laptop although its only 10,000 entries and the functions are very very basic.

Do you have any advice on how to speed up the code ???

This whole question is not reproducible and doesn't contain a desired output. I doubt anyone can help you with so little *useful* information- especially on Sunday. — David Arenburg, Nov 22 '15 at 10:51
[Please, read the info on how to give a reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610) — Jaap, Nov 22 '15 at 10:59
ok guys im going to work to try to make a reproducible example. its going to take a bit of while to work on it. thanks for your comments — Fagui Curtain, Nov 22 '15 at 12:21
@jogo ok thanks. im going to change my implementation without any apply and report on that — Fagui Curtain, Nov 22 '15 at 23:36
Please omit "EDITED" notices. If folks want to dig into the edit history for the post, they can do that without your declaration. — Frank, Nov 23 '15 at 19:33

Fagui Curtain · Accepted Answer · 2015-11-26T01:02:25.193

i have delved further myself into this issue, and here is my answer. I think its an important one, and everyone should know it so please vote for this post, it doesn't deserve its bad score !!

The code to the answer is below. I have put some new parameters to make the problem a bit more general. The key point is to use the unlist function. Whenever we use apply to a list object, we get very very bad performance in R. its a bit of a pain in the ass to explode objects and to do manual indexing in a vector, but the speedup is phenomenal.

set.seed=234
N=100
nobs=10000
z = lapply(1:N, function(x) sample(1:4,size=sample(3:5),replace=TRUE))
is_sub_num <- function(a,b){sum(!(a==b))==1}
is_pair_ok  <- function(pair){
        is_sub_num(DT[ID==pair[1],V2][[1]],DT[ID==pair[2],V2][[1]])}

is_pair_ok1  <- function(pair){
        is_sub_num(zzz[pos_table[pair[1]]:(pos_table[pair[1]]+length_table[pair[1]] -1) ],
                   zzz[pos_table[pair[2]]:(pos_table[pair[2]]+length_table[pair[2]] -1) ]) }

pair_list <- as.data.table(cbind(sample(1:N,nobs,replace=TRUE),sample(1:N,nobs,replace=TRUE)))

DT <- as.data.table(1:N)
DT$V2 <- z
setnames(DT, c("ID","V2"))
setkey(DT, ID)

length_table <- sapply(z,length)
myfun <- function(i){sum(length_table[1:i])}
pos_table <- c(0,sapply(1:N,myfun))+1
zzz=unlist(z)

print(system.time(tmp_ref <- apply(pair_list,1,is_pair_ok)))
print(system.time(tmp <- apply(pair_list,1,is_pair_ok1)))
identical(tmp,tmp_ref)

here is the output

utilisateur     système      écoulé 
      20.96        0.00       20.96 
utilisateur     système      écoulé 
       0.70        0.00        0.71 
There were 50 or more warnings (use warnings() to see the first 50)
[1] TRUE

EDIT it would a bit too long to post here. I tried to draw conclusions from the above and modify the source code of my program by trying to speed it up and using unlist, and manual indexing. the new implementation actually is slower which came as a surprise to me, and i fail to understand why...

jogo · Answer 2 · 2015-11-23T19:36:21.860

now I have 60% spare of time:

library(data.table)
set.seed(234)

is_sub_num <- function(a,b) sum(!(a==b))==1
is_pair_ok2  <- function(p1, p2) is_sub_num(DT[p1,V2][[1]],DT[p2,V2][[1]]) 

DT <- as.data.table(1:100)
DT$V2 <- lapply(1:100, function(x) sample(1:4,size=4,replace=TRUE)) 
setnames(DT, c("ID","V2"))
setkey(DT, ID)

pair_list <- as.data.table(cbind(sample(1:100,10000,replace=TRUE),sample(1:100,10000,replace=TRUE)))
print(system.time(tmp <- mapply(FUN=is_pair_ok2, pair_list$V1, pair_list$V2)))

most effect had setting the key for DT and using fast indexing in is_pair_ok2()

a little bit more (without the function is_sub_num()):

is_pair_ok3  <- function(p1, p2) sum(DT[p1,V2][[1]]!=DT[p2,V2][[1]])==1
print(system.time(tmp <- mapply(FUN=is_pair_ok3, pair_list$V1, pair_list$V2)))

thanks for looking at it. i have found a good answer please look at it — Fagui Curtain, Nov 25 '15 at 06:19

optimizing code in R for vector comparisons in data.table

2 Answers2