I'm very new to R and I'm trying to match two DataFrames in R based on a string that contains a product name. The brand name sits in a different array to the product name. Typically variations sit towards the end of the string for different product and end/middle for product variations (i.e. colours).
Unfortunately I am receiving a lot of false positives or products that don't match.
Using the levenshtein distance these two products are matched as a false positive
Brand name = [ADIDAS ORIGINALS], Product name = [bananas print tank top]
Brand name = [ADIDAS ORIGINALS], Product name = [bananas print shorts]
The approach I'm using at the moment makes no distinction in scoring between products that are variations and different products of the same line (as seen above) which either misses a lot of products or results in false postives
Brand name = [ADIDAS ORIGINALS], Product name = [Superstar 80s Black Metal Toe Cap Trainers]
Brand name = [ADIDAS ORIGINALS], Product name = [Superstar Super Colour Sun Glow Trainers]
I'm wondering whether there is a method that allows me to score strings based on matching sub-strings (i.e. 4/5 words match) instead of the traditional string matching techniques or to assign different weights to variations at the end of a string to solve my problem.
Store1
Brand store1 Prod.name store1
Adidas Originals Bananas Print Tank Top
Adidas Originals Bananas Print Shorts
Oasis Geo Lace Drape Cardigan
Michael Kors Hamilton Saffiano Leather Tote
Phase Eight Analise Print Dress
Indulgence Red maxi dress
Store2
Brand store2 Prod.name store2
Adidas Originals Bananas Print Tank Top
Adidas Originals Superstar Super Colour Sun Glow Trainers
Oasis Geo Lace Drape Cardigan
Michael Kors Hamilton Saffiano Leather Tote
Phase Eight Analise Print Dress
Indulgence Red maxi dress
How I would like to match them
Brand store1 Prod.name store1 Prod.name store2
Adidas Originals Bananas Print Tank Top Bananas Print Tank Top
Adidas Originals Bananas Print Shorts NULL
Oasis Geo Lace Drape Cardigan Geo Lace Drape Cardigan
Michael Kors Hamilton Saffiano Leather Tote Hamilton Saffiano Leather Tote
Phase Eight Analise Print Dress Analise Print Dress
Indulgence Red maxi dress Red maxi dress
Below is the code I'm using which (with help from r-bloggers)-EDIT: sample files
source1.devices<-read.csv('store1.csv')
source2.devices<-read.csv('store2.csv')
source1.devices$name<-as.character(store1.csv$prod.name)
source2.devices$name<-as.character(store2.csv$prod.name)
dist.name<-adist(store1.csv$prod.name,store2.csv$prod.name, partial = TRUE, ignore.case = TRUE)
min.name<-apply(dist.name, 1, min)
match.s1.s2<-NULL
for(i in 1:nrow(dist.name))
{
s2.i<-match(min.name[i],dist.name[i,])
s1.i<-i
match.s1.s2<-rbind(data.frame(s2.i=s2.i,s1.i=s1.i,s2name=store2.csv[s2.i,]$prod.name, s1name=store1.csv[s1.i,]$prod.name, adist=min.name[i]),match.s1.s2)
}
View(match.s1.s2)