I am trying to identify observations that match between two datasets, using text string vectors $contractor
and $employer
, and create a TRUE/FALSE indicator on whether the contractor is in the employer list.
library(caTools)
list<-data.frame(ID=c(1:6),
employer=c("a.c. construction","abc concrete company","xyz pool construction inc","frank studebager llc","annoying contractors llc","beaumont ditch digging co inc"))
jobs<-data.frame(contractor=c("a-c construction","hank hill construction","xyz pool const incorporated","frank studebaer co","hank hill const"),
value=c(400000,284590,410280,310980))
jobs$match<-pmatch(jobs$contractor,list$employer,duplicates.ok=TRUE)
The pmatch command says there are 0 matches, but this is because the company names are sloppily entered and not spelled consistently; there are obviously matches. I have also used the fuzzy matching command agrepl, but in my actual data the number and quality of matching varies incredibly with small changes to the accepted Levenshtein distance.
There are also some answers here and here but my lack of advanced programming experience has kept me from applying the concepts there. Any thoughts are appreciated!