As part of my program in R, I have to compare a huge number of pair of sentences with some functions (the one im showing here is comparing sentences with the same number of words, and whether there is just exactly one different word between those two sentences)
To make things faster, I have already converted all words into integers so I am dealing with integer vectors so the example function is a very simple one
is_sub_num <- function(a,b){sum(!(a==b))==1}
where a,b are character vectors such as
a = c(1,2,3); b=c(1,4,3)
is_sub_num(a,b)
# [1] TRUE
my data will be stored in a data.table
Classes ‘data.table’ and 'data.frame': 100 obs. of 2 variables:
$ ID: int 1 2 3 4 5 6 7 8 9 10 ...
$ V2:List of 100
..$ : int 4 4 3 4
..$ : int 1 2 3 1
the length of each entry may be different (in the example below, the entries are all of size 4)
I have a table with candidate pair IDs to test the corresponding entries in DT with the function above as follow
is_pair_ok <- function(pair){
is_sub_num(DT[ID==pair[1],V2][[1]],DT[ID==pair[2],V2][[1]])}
here is a simplification of what I'm trying to do:
set.seed=234
z = lapply(1:100, function(x) sample(1:4,size=4,replace=TRUE))
is_sub_num <- function(a,b){sum(!(a==b))==1}
is_pair_ok <- function(pair){
is_sub_num(DT[ID==pair[1],V2][[1]],DT[ID==pair[2],V2][[1]])}
pair_list <- as.data.table(cbind(sample(1:100,10000,replace=TRUE),sample(1:100,10000,replace=TRUE)))
DT <- as.data.table(1:100)
DT$V2 <- z
colnames(DT) <- c("ID","V2")
print(system.time(tmp <-apply(pair_list,1,is_pair_ok)))
this takes around 22 seconds on my laptop although its only 10,000 entries and the functions are very very basic.
Do you have any advice on how to speed up the code ???