0

I'm trying to write a loop to perform the following actions on a data frame:

For every name in the 'Name' column, check to see if a rough match (accomplished with agrep() ) exists in the 'Referral' column. If a match exists, replace all cells in the 'Referral' column that roughly match the name with 'referral'.

Here is my code so far:

for (i in 1:1000){
 for (q in 1:length(agrep(c$Name[i], c$Referal))){
  if (length(agrep(c$Name[i], c$Referal)>0)){
   c$Referal[agrep(c$Name[i], c$Referal)[q]]<-'panda'
  }
 }
}

This code, however, (after it takes 20 mins to run) replaces ALL cells in the 'Referral' column with 'referral'. I'm wondering if the 'i' in the first row stays the same throughout the whole loop? Obviously a clunky-ass chunk of code, but I can't think of why it would do this...

An example would be:

Name <- c('michael jordan', 'carrot', 'ginger')
Referral <-('internet', 'facebook', 'mike jordan')
df <- data.frame(Name, Referral)

After running the function, ideally df$Referral[3]=='referral' would be TRUE.

Will
  • 7
  • 1
  • 4
  • 1
    Please share a reproducible example of your data frame and your desired output. It is possible that you can use other approaches to replace the `for loop` operation. But we need to know your data to be able to help. – www Jul 18 '17 at 14:50
  • You can refer to this link to learn how to make a reproducible example: https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example. – www Jul 18 '17 at 14:51
  • Thanks— I'll work on getting a portion of the dataset up that reproduces the error – Will Jul 18 '17 at 15:01

2 Answers2

0

Have you tried using the value=TRUE option?

As is, your code is returning the whole vector from agrep(). Using the value=TRUE should only return the matching objects.

Edit: here's some code to help you understand what agrep() is returning and how it works within your loop.

Name <- c('michael jordan', 'carrot', 'ginger')
Referral <- c('internet', 'facebook', 'mike jordan')
df <- data.frame(Name, Referral)    

agrep(df$Name[1], df$Referral, max = 5, value = TRUE)

# [1] "mike jordan"

length(agrep(df$Name[1], df$Referral, max = 5, value = TRUE))

# [1] 1

agrep(df$Name[1], df$Referral, max.distance = 0.5)

# integer(0)

df$Referral[agrep(df$Name[1], df$Referral)[1]]

# NULL

for (i in 1:3){
  for (q in 1:length(agrep(df$Name[i], df$Referral))){
    if (length(agrep(df$Name[i], df$Referral)>0)){
      df$Referral[agrep(df$Name[i], df$Referral)[q]] <-'panda'
    }
  }
}

Just to make clearer the importance of the max argument. Here are 3 examples keeping in mind that integer(0) is a zero-length vector in R.

> agrep(df$Name[1], df$Referral) 
integer(0) 
> agrep(df$Name[1], df$Referral, max.distance = 0.2) 
integer(0) 
> agrep(df$Name[1], df$Referral, max.distance = 0.5) 
[1] 3 
zfisher
  • 61
  • 6
  • Where would you insert the value=TRUE argument? In all lines? – Will Jul 18 '17 at 15:01
  • Oh I see what you're saying. Where you're suggesting I return a value I currently have an index— my solution right now is to reference the value via the index in the second line using 'q'. That's definitely an efficiency improvement, thanks. I still wonder why the longer way wouldn't work! – Will Jul 18 '17 at 15:12
  • You also haven't set a `maxdistance`. `agrep()` uses the Levenshtein distance to find the minimum number of transformations needed to match both words. The default is `max.distance = 0.1` which means that anything that is further than 10% of the target pattern will not be matched. – zfisher Jul 18 '17 at 16:10
  • Thanks for breaking that down. Could you explain why `agrep(df$Name[1], df$Referral)` fails to see the similarity between Name[1] and Referral[3], but succeeds in finding the same similarity when you use the whole column of Names as the search pattern (e.g. `agrep(df$Name, df$Referral, max = 5, value = TRUE)`)? – Will Jul 18 '17 at 17:37
0

Not sure agrep will work for you in the way you expect:

agrep('michael jordan', 'mike jordan')
integer(0)

So I changed your data a bit. mike jordan is now in both vectors:

Name <- c('mike jordan', 'carrot', 'ginger')
Referral <-c('internet', 'facebook', 'mike jordan')

I also changed the logical condition to Referral[x] %in% Name. Then you can do

library(tidyverse)
newReferral <- map_chr(Referral, ~ifelse(.x %in% Name,'referral', .x))

[1] "internet" "facebook" "referral"
CPak
  • 13,260
  • 3
  • 30
  • 48
  • Thank you!! That tidyverse function is incredible. When my original code was successful, it took about 1 minute for every 500 rows in the dataframe. This operates on all 11,000 rows in about 4 seconds. Thank you so much. – Will Jul 18 '17 at 17:33
  • Glad to help. You can close this question by accepting an answer. If a better answer comes along after, you can re-accept a new answer as well. – CPak Jul 18 '17 at 18:18