Apply a custom matching function over large dataframe

Question

I have created a function that checks if there is a match between 2 integers from 2 different files. The match is done based on the longest match of the beginning of the numbers.

lookup value: 12345678   #this is a value from the text_clean table that contains 19million values.

the values from the cost_table:
12
123
124
125
1234
1235

this gives a match with 1234

The problem is that I have 19 million values to lookup and match those values with values in a list of 300,000 values. I stored both table in a separate dataframes.

I'm now using the apply function to preform this function over every single row. 1,000 values takes around 2 minutes, which will takes several days to finish for the full data. Is there a faster way of doing this?

I though maybe using lapply in combination with the multicore package, but I cannot figure out how to get that working.

This is the function I'm using.

apply(textclean[1:1000,], 1, function(i){

  counter <<- counter + 1

  if (is.na(i)){
    next
  }

  lookupVec <- floor(i* (10 ^ (-1 * (0:(nchar(i)-1)))))  #construct a vector of all possible matches for the lookup value 
  duration <- text$duration[counter] 
  number <- text$number[counter] 
  code <- cost$CODE[which.min(fmatch(cost$CODE, lookupVec))]  #matches the longest possible value from the textclean_table with the value from the cost_table


  cat(paste(code, duration, number, sep=","), 
     file= ".../outputfile.csv", append = T, fill = T)  #the output is stored in a csv file.
  }
)

Your function takes time, not the looping mechanism (apply, lapply, etc.). Parallel processing *can* speed things up, but it will be limited by the number of cores - if you have 4 cores the absolute most you will get is 4x speed (it will not quite be that much), so several days will be reduced to maybe 1 day - still not great. I think you need to optimize your function instead. — Gregor Thomas, Nov 28 '17 at 15:52
At a glance, move as much *outside* the loop as possible. Can `lookupVec` be pre-defined, rather than computed each time? You seem to be using data frames, can you use a matrix instead? Opening a file connection and writing to disk is slow, can you accumulate the results in R and just write to file at the end? If you post sample input and output and describe a little more what you are doing we can try to help. — Gregor Thomas, Nov 28 '17 at 15:55
I agree fully with Gregor's initial steps. But while I'm trying to reproduce it for some small benchmarking, it would help immensely if this were fully [reproducible](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). For example, what packages are you using? (`fmatch` is not found). Can you give a small yet sufficient sample of `textclean`, `text`, and `cost`? (Have I missed one?) One test is being able to run your question-code completely in a new R session without sourcing any of your original code. — r2evans, Nov 28 '17 at 16:03
You should say that your question is a follow-up of [this question](https://stackoverflow.com/questions/47517561/find-the-longest-match-of-2-integers-in-r). — Rui Barradas, Nov 28 '17 at 16:17

Apply a custom matching function over large dataframe

0 Answers0