0

I am trying to identify observations that match between two datasets, using text string vectors $contractor and $employer, and create a TRUE/FALSE indicator on whether the contractor is in the employer list.

library(caTools)
list<-data.frame(ID=c(1:6),
     employer=c("a.c. construction","abc concrete company","xyz pool construction inc","frank studebager llc","annoying contractors llc","beaumont ditch digging co inc"))
jobs<-data.frame(contractor=c("a-c construction","hank hill construction","xyz pool const incorporated","frank studebaer co","hank hill const"),
     value=c(400000,284590,410280,310980))
jobs$match<-pmatch(jobs$contractor,list$employer,duplicates.ok=TRUE)

The pmatch command says there are 0 matches, but this is because the company names are sloppily entered and not spelled consistently; there are obviously matches. I have also used the fuzzy matching command agrepl, but in my actual data the number and quality of matching varies incredibly with small changes to the accepted Levenshtein distance.

There are also some answers here and here but my lack of advanced programming experience has kept me from applying the concepts there. Any thoughts are appreciated!

Eric
  • 1
  • 3
  • 1
    Welcome to SO. Take a look at: https://stackoverflow.com/help/how-to-ask so we can better assist you! – Adam Warner Jul 23 '18 at 20:32
  • 1
    Please add a sample dataset – M.Punt Jul 23 '18 at 20:42
  • See [how to make a great reproducible example](https://stackoverflow.com/a/5963610/2359523) to aid in answering your question. The warning you see is because the pattern `grep` is expecting is a string, not a vector. So it uses only the first element of `A$Contractor` as the pattern to match. See if [this question provides you a solution](https://stackoverflow.com/questions/34951410/partial-string-match-two-columns-r). – Anonymous coward Jul 23 '18 at 20:46
  • To avoid the warning you should enclose your command in a loop (for), analyzing each element in one of the vectors at each step of the loop. Also, to reduce the problem with punctuation marks, you could remove all of them, and maybe even remove the spaces, before comparison. This should improve the performance. Try tuning the `max.distance` parameter as well (0.4 seems too high to me). – Rodrigo Jul 23 '18 at 21:15
  • Thanks for your comment Rodrigo. As per `grepl`, I have tested many values of the `max.distance` parameter, and the command seems to be extremely sensitive to that choice, even to the point of suggesting very bad matches (while ignoring seemingly objectively better ones). So the loop would analyze each character? It would seem the `max.distance` command would require that, or the Levenshtein distance would simply exist or not, hence a range would not be possible (if it were only analyzing the first character in each vector). – Eric Jul 26 '18 at 11:47

0 Answers0