8

I have two vector of type character in R.

I want to be able to compare the reference list to the raw character list using jarowinkler and assign a % similarity score. So for example if i have 10 reference items and twenty raw data items, i want to be able to get the best score for the comparison and what the algorithm matched it to (so 2 vectors of 10). If i have raw data of size 8 and 10 reference items, i should only end up with a 2 vector result of 8 items with the best match and score per item

item, match, matched_to ice, 78, ice-cream

Below is my code which isn't much to look at.

NumItems.Raw = length(words)
NumItems.Ref = length(Ref.Desc)

for (item in words) 
{
  for (refitem in Ref.Desc)
  {
    jarowinkler(refitem,item)

    # Find Best match Score
    # Find Best Item in reference table
    # Add both items to vectors
    # decrement NumItems.Raw
    # Loop
  }
} 
John Smith
  • 2,448
  • 7
  • 54
  • 78
  • 1
    Perhaps the RecordLinkage package, and a function that builds from this? compareJW <- function(string, vec, cutoff) { require(RecordLinkage) jarowinkler(string, vec) > cutoff } – lawyeR Mar 17 '15 at 15:36
  • What are your criteria for matching if there are multiple best fits with the same jarowinkler score? Do you pick the first match, or use a random selection of the best matches? –  Mar 17 '15 at 15:39

2 Answers2

14

Using a toy example:

library(RecordLinkage)
library(dplyr)

ref <- c('cat', 'dog', 'turtle', 'cow', 'horse', 'pig', 'sheep', 'koala','bear','fish')
words <- c('dog', 'kiwi', 'emu', 'pig', 'sheep', 'cow','cat','horse')

wordlist <- expand.grid(words = words, ref = ref, stringsAsFactors = FALSE)
wordlist %>% group_by(words) %>% mutate(match_score = jarowinkler(words, ref)) %>%
summarise(match = match_score[which.max(match_score)], matched_to = ref[which.max(match_score)])

gives

 words     match matched_to
1   cat 1.0000000        cat
2   cow 1.0000000        cow
3   dog 1.0000000        dog
4   emu 0.5277778       bear
5 horse 1.0000000      horse
6  kiwi 0.5350000      koala
7   pig 1.0000000        pig
8 sheep 1.0000000      sheep

Edit: As a response to the OP's comment, the last command uses the pipeline approach from dplyr, and groups every combination of the raw words and references by the raw words, adds a column match_score with the jarowinkler score, and returns only a summary of the highest match score (indexed by which.max(match_score)), as well as the reference which also is indexed by the maximum match_score.

  • Hi Jim M. Thank you for your answer, I never know about the expand.grid function. Can you explain to me how the last command is working? – John Smith Mar 17 '15 at 15:51
  • Thanks @Jim M, I managed to get it working with your code no problems at all but came up against the problem outlined [here](http://stackoverflow.com/questions/29119052/r-function-exception-in-evalexpr-envir-enclos-unknown-column/29120048#29120048) which is basically down to NSL being used in the Group and the jarowinkler function. I will still play around with it, your answer is excellent and opened up new areas of R for me...Thank you very much – John Smith Mar 18 '15 at 15:22
  • @Jim M can i achieve this for huge dataframes as well? – KRU May 12 '15 at 06:21
  • @KRU: It would depend on what you mean by huge. I believe the limit would be a data.frame of 2^31 - 1 rows that could be accessed at once, otherwise the data may have to be subdivided into chunks for analysis. –  May 13 '15 at 16:15
3

There is a package which already implements the Jaro-Winkler distance.

> install.packages("stringdist")
> library(stringdist)
> 1-stringdist('ice','ice-cream',method='jw')
[1] 0.7777778
Ken Yeoh
  • 876
  • 6
  • 11
  • Hi @Ken Yeoh, thank you for your reply. This in essence gives me the same thing as jarowinkler(refitem,item) but my problem is if i have many things to reference against, i only want it to return the top match and the percentage match. So i have 1 item i wish to check and 8 items to check it against, i want to be able to return just one answer, i.e. the highest match and what that match in the reference table is – John Smith Mar 17 '15 at 15:17
  • Ah, sorry I misunderstood your question. You can do this with stringdistmatrix, but I think @Jim M's answer is easier to implement. – Ken Yeoh Mar 17 '15 at 16:16