1

I'm very new to R and I'm trying to match two DataFrames in R based on a string that contains a product name. The brand name sits in a different array to the product name. Typically variations sit towards the end of the string for different product and end/middle for product variations (i.e. colours).

Unfortunately I am receiving a lot of false positives or products that don't match.

  • Using the levenshtein distance these two products are matched as a false positive

    Brand name = [ADIDAS ORIGINALS], Product name = [bananas print tank top]

    Brand name = [ADIDAS ORIGINALS], Product name = [bananas print shorts]

  • The approach I'm using at the moment makes no distinction in scoring between products that are variations and different products of the same line (as seen above) which either misses a lot of products or results in false postives

    Brand name = [ADIDAS ORIGINALS], Product name = [Superstar 80s Black Metal Toe Cap Trainers]

    Brand name = [ADIDAS ORIGINALS], Product name = [Superstar Super Colour Sun Glow Trainers]

I'm wondering whether there is a method that allows me to score strings based on matching sub-strings (i.e. 4/5 words match) instead of the traditional string matching techniques or to assign different weights to variations at the end of a string to solve my problem.

Store1

Brand store1      Prod.name store1
Adidas Originals  Bananas Print Tank Top
Adidas Originals  Bananas Print Shorts
Oasis             Geo Lace Drape Cardigan
Michael Kors      Hamilton Saffiano Leather Tote
Phase Eight       Analise Print Dress
Indulgence        Red maxi dress

Store2

Brand store2      Prod.name store2
Adidas Originals  Bananas Print Tank Top
Adidas Originals  Superstar Super Colour Sun Glow Trainers
Oasis             Geo Lace Drape Cardigan
Michael Kors      Hamilton Saffiano Leather Tote
Phase Eight       Analise Print Dress
Indulgence        Red maxi dress

How I would like to match them

Brand store1      Prod.name store1               Prod.name store2
Adidas Originals  Bananas Print Tank Top         Bananas Print Tank Top
Adidas Originals  Bananas Print Shorts           NULL
Oasis             Geo Lace Drape Cardigan        Geo Lace Drape Cardigan
Michael Kors      Hamilton Saffiano Leather Tote Hamilton Saffiano Leather Tote
Phase Eight       Analise Print Dress            Analise Print Dress
Indulgence        Red maxi dress                 Red maxi dress

Below is the code I'm using which (with help from r-bloggers)-EDIT: sample files

source1.devices<-read.csv('store1.csv')
source2.devices<-read.csv('store2.csv')

source1.devices$name<-as.character(store1.csv$prod.name)
source2.devices$name<-as.character(store2.csv$prod.name)

dist.name<-adist(store1.csv$prod.name,store2.csv$prod.name, partial = TRUE, ignore.case = TRUE)

min.name<-apply(dist.name, 1, min)

match.s1.s2<-NULL  
for(i in 1:nrow(dist.name))
{
    s2.i<-match(min.name[i],dist.name[i,])
    s1.i<-i
    match.s1.s2<-rbind(data.frame(s2.i=s2.i,s1.i=s1.i,s2name=store2.csv[s2.i,]$prod.name, s1name=store1.csv[s1.i,]$prod.name, adist=min.name[i]),match.s1.s2)
}

View(match.s1.s2)
Davis
  • 466
  • 4
  • 20
  • Welcome to SO! I think it would help if you add few correct matches/expected output. One way to post the data is copying the output of `dput(head(source1.devices,10))` and `dput(head(source2.devices,10))` – Silence Dogood May 15 '16 at 19:59
  • Thanks for the welcome :) Do you mean, adding correct matches will "train" my scoring algorithm and improve the accuracy? – Davis May 15 '16 at 21:07
  • Hm no, I believe he meant it would help to add more example input data from `store1.csv` and `store2.csv` to your post plus the ideal output of the solution. Most readers here quickly scan for an input data frame and an request output data frame instead of reading text and lengthy code examples . :) – lukeA May 15 '16 at 21:29
  • What is intended is that we do not know what you want. You should provide a mini data set with the output that you want to get. See [how to make a reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) for some guidance. – lmo May 15 '16 at 21:30
  • Thanks for letting me know (again I'm new to SO, so apologies). I've added the sample files in the original question and made a note of the edit. Thanks! – Davis May 15 '16 at 22:24
  • Close, but you're missing the point. Most readers quickly scan for input data, not a few click-throughs to download files and import. The general etiquette (for R stuff, at least) is to provide a *minimal and reproducible example* (see @lmo's comment link) completely within the text of the question so that we can quickly process things. (Also, being paranoid, I'd rather not click on links that provide no clear indication of what is on the other side.) – r2evans May 16 '16 at 01:14
  • I think an answer would be beyond the scope of SO, because the data is very messy. To me, it's even unclear what your desired result from the example (e.g. also as a csv) should look like. I 1st thought you could take the Brand column as a blocking field, and turn Prod.name into a (cleansed) term-document-matrix. But even there you got inconsitencies. I guess it's best to try to cleanse the columns by tokenizing and removing all stopwords and stremming. Then try some Jaccard- or simple matching or even machine learning w/ predined training links using the RecordLinkage package. – lukeA May 16 '16 at 10:54
  • Maybe using a sorted jaro-winkler dist on a cleansed Prod.name column works, too. How big are your CSVs? – lukeA May 16 '16 at 10:56
  • Thanks for being patient and explaining the process, hopefully I got the minimal data set right this time. I'm only working with a small data set for testing at the moment but true set is very big (+1M rows). I'm wondering whether it's possible to score based on the number words/sub-strings that would have to be substituted instead of characters which is causing a lot of poor scoring at the moment because some prod.names are longer than others i.e. "Red" vs "Purple" should only be 1 distance apart. – Davis May 16 '16 at 17:59

0 Answers0