-1

As part of my program in R, I have to compare a huge number of pair of sentences with some functions (the one im showing here is comparing sentences with the same number of words, and whether there is just exactly one different word between those two sentences)

To make things faster, I have already converted all words into integers so I am dealing with integer vectors so the example function is a very simple one

is_sub_num <- function(a,b){sum(!(a==b))==1}

where a,b are character vectors such as

a = c(1,2,3); b=c(1,4,3) 
is_sub_num(a,b)
# [1] TRUE

my data will be stored in a data.table

Classes ‘data.table’ and 'data.frame':  100 obs. of  2 variables:
 $ ID: int  1 2 3 4 5 6 7 8 9 10 ...
 $ V2:List of 100
  ..$ : int  4 4 3 4
  ..$ : int  1 2 3 1

the length of each entry may be different (in the example below, the entries are all of size 4)

I have a table with candidate pair IDs to test the corresponding entries in DT with the function above as follow

is_pair_ok  <- function(pair){
            is_sub_num(DT[ID==pair[1],V2][[1]],DT[ID==pair[2],V2][[1]])}

here is a simplification of what I'm trying to do:

set.seed=234
z = lapply(1:100, function(x) sample(1:4,size=4,replace=TRUE))
is_sub_num <- function(a,b){sum(!(a==b))==1}
is_pair_ok  <- function(pair){
        is_sub_num(DT[ID==pair[1],V2][[1]],DT[ID==pair[2],V2][[1]])}

pair_list <- as.data.table(cbind(sample(1:100,10000,replace=TRUE),sample(1:100,10000,replace=TRUE)))

DT <- as.data.table(1:100)
DT$V2 <- z
colnames(DT) <- c("ID","V2")

print(system.time(tmp <-apply(pair_list,1,is_pair_ok)))

this takes around 22 seconds on my laptop although its only 10,000 entries and the functions are very very basic.

Do you have any advice on how to speed up the code ???

jogo
  • 12,469
  • 11
  • 37
  • 42
Fagui Curtain
  • 1,867
  • 2
  • 19
  • 34

2 Answers2

1

i have delved further myself into this issue, and here is my answer. I think its an important one, and everyone should know it so please vote for this post, it doesn't deserve its bad score !!

The code to the answer is below. I have put some new parameters to make the problem a bit more general. The key point is to use the unlist function. Whenever we use apply to a list object, we get very very bad performance in R. its a bit of a pain in the ass to explode objects and to do manual indexing in a vector, but the speedup is phenomenal.

set.seed=234
N=100
nobs=10000
z = lapply(1:N, function(x) sample(1:4,size=sample(3:5),replace=TRUE))
is_sub_num <- function(a,b){sum(!(a==b))==1}
is_pair_ok  <- function(pair){
        is_sub_num(DT[ID==pair[1],V2][[1]],DT[ID==pair[2],V2][[1]])}

is_pair_ok1  <- function(pair){
        is_sub_num(zzz[pos_table[pair[1]]:(pos_table[pair[1]]+length_table[pair[1]] -1) ],
                   zzz[pos_table[pair[2]]:(pos_table[pair[2]]+length_table[pair[2]] -1) ]) }

pair_list <- as.data.table(cbind(sample(1:N,nobs,replace=TRUE),sample(1:N,nobs,replace=TRUE)))

DT <- as.data.table(1:N)
DT$V2 <- z
setnames(DT, c("ID","V2"))
setkey(DT, ID)

length_table <- sapply(z,length)
myfun <- function(i){sum(length_table[1:i])}
pos_table <- c(0,sapply(1:N,myfun))+1
zzz=unlist(z)

print(system.time(tmp_ref <- apply(pair_list,1,is_pair_ok)))
print(system.time(tmp <- apply(pair_list,1,is_pair_ok1)))
identical(tmp,tmp_ref)

here is the output

utilisateur     système      écoulé 
      20.96        0.00       20.96 
utilisateur     système      écoulé 
       0.70        0.00        0.71 
There were 50 or more warnings (use warnings() to see the first 50)
[1] TRUE

EDIT it would a bit too long to post here. I tried to draw conclusions from the above and modify the source code of my program by trying to speed it up and using unlist, and manual indexing. the new implementation actually is slower which came as a surprise to me, and i fail to understand why...

Fagui Curtain
  • 1,867
  • 2
  • 19
  • 34
0

now I have 60% spare of time:

library(data.table)
set.seed(234)

is_sub_num <- function(a,b) sum(!(a==b))==1
is_pair_ok2  <- function(p1, p2) is_sub_num(DT[p1,V2][[1]],DT[p2,V2][[1]]) 

DT <- as.data.table(1:100)
DT$V2 <- lapply(1:100, function(x) sample(1:4,size=4,replace=TRUE)) 
setnames(DT, c("ID","V2"))
setkey(DT, ID)

pair_list <- as.data.table(cbind(sample(1:100,10000,replace=TRUE),sample(1:100,10000,replace=TRUE)))
print(system.time(tmp <- mapply(FUN=is_pair_ok2, pair_list$V1, pair_list$V2)))

most effect had setting the key for DT and using fast indexing in is_pair_ok2()

a little bit more (without the function is_sub_num()):

is_pair_ok3  <- function(p1, p2) sum(DT[p1,V2][[1]]!=DT[p2,V2][[1]])==1
print(system.time(tmp <- mapply(FUN=is_pair_ok3, pair_list$V1, pair_list$V2)))
jogo
  • 12,469
  • 11
  • 37
  • 42