I have created a function that checks if there is a match between 2 integers from 2 different files. The match is done based on the longest match of the beginning of the numbers.
lookup value: 12345678 #this is a value from the text_clean table that contains 19million values.
the values from the cost_table:
12
123
124
125
1234
1235
this gives a match with 1234
The problem is that I have 19 million values to lookup and match those values with values in a list of 300,000 values. I stored both table in a separate dataframes.
I'm now using the apply
function to preform this function over every single row. 1,000 values takes around 2 minutes, which will takes several days to finish for the full data. Is there a faster way of doing this?
I though maybe using lapply
in combination with the multicore
package, but I cannot figure out how to get that working.
This is the function I'm using.
apply(textclean[1:1000,], 1, function(i){
counter <<- counter + 1
if (is.na(i)){
next
}
lookupVec <- floor(i* (10 ^ (-1 * (0:(nchar(i)-1))))) #construct a vector of all possible matches for the lookup value
duration <- text$duration[counter]
number <- text$number[counter]
code <- cost$CODE[which.min(fmatch(cost$CODE, lookupVec))] #matches the longest possible value from the textclean_table with the value from the cost_table
cat(paste(code, duration, number, sep=","),
file= ".../outputfile.csv", append = T, fill = T) #the output is stored in a csv file.
}
)