0

I have a for loop that compares 2 addresses column to make the third column. i am having a hard time converting this for loop to apply function that takes arguments too.
code that works:

for (i in 1:length(df_name_address$col1)){
  print(i)
  df_test$flag[i] <- SequenceMatcher$new(tolower(df_test$address[i]),tolower(df_test$address2[i]))$ratio()
}

NOTE: sequenceMatcher is just a function in fuzzywuzzyR so dont need to worry about it i just want to convert this to apply or something in the same family as the efficiency is really low for for loop and big datasets

sample:

col1   address  address2    flag
1      abced     abcd ad    0
2      def        def       1
3      abcdef     abcdef    1
4      xqc        abc       0

function tried::

seqM2 <- function(x,table,flag,one,two) {
  for (i in 1:length(table$one)){     return(SequenceMatcher$new(tolower(table$one[i]),tolower(table$two[i]))$ratio())
  }
}

where
table = Data frame
flag = new column
one = address column
two = address column 2
how do I pass this to mapply?

Rishi
  • 313
  • 1
  • 4
  • 18
  • When asking for help, you should include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. – MrFlick Sep 10 '18 at 18:32
  • sorry about that: edited – Rishi Sep 10 '18 at 18:35
  • 2
    `apply` is for matrices. You've got a data frame, not a matrix. `lapply` and `sapply` are good for applying functions each data frame column, but here you have a function that takes multiple columns at once. Your for loop is fine, but if you want an `*apply` family function, `mapply` (or `Map`) is what you're looking for. [See here for details](https://stackoverflow.com/a/7141669/903061). – Gregor Thomas Sep 10 '18 at 18:38
  • the main problem I am encountering is passing the arguments, i will edit the question and include the function i made – Rishi Sep 10 '18 at 18:40
  • 1
    Also, based on your edits, efficiency is your problem. Most likely you'll be disappointed by the gains in switching from `for` to apply. [See here, for example](https://stackoverflow.com/q/2275896/903061). You will get extremely small gains at best, and with newer version of R having a JIT compiler possibly a slower result overall. (Though even then, marginally.) You'd probably do much better to switch your data frame to a `data.table`. – Gregor Thomas Sep 10 '18 at 18:45
  • is there any way I can make the for loop faster? i read about using require(compiler) enableJIT(3), but i don't think it made the code faster for me. there are a more than a million records and the code is still in loop for the past 1 hour – Rishi Sep 10 '18 at 18:49
  • 1
    Looping in itself doesn't take much time, it's what you do inside the loop that takes time. You'll save a little bit of time if you do use `tolower` on the whole vector before the loop. But whatever `SequenceMatcher$new` does is what is taking real time. You can run in parallel, using the `foreach` package, for example. Or you could try to optimize the `SequenceMatcher$new` function - if you only need the `$ratio` part of the output, perhaps parts of it could be cut? – Gregor Thomas Sep 10 '18 at 19:12
  • SequenceMatcher is just a function like levenshteinSim under the RecordLinkage package, but it's just better at comparing addresses, so i will look for an alternative and i should do the preprocessing and then use it and try again with converting columns to lower before applying the function and running the for loop. thanks for helping me out – Rishi Sep 10 '18 at 19:18

0 Answers0