4

I need calculate Jaccard similarity between each words in 2 vectors. Each word by each word. And extract most similar word.

Here is my bad bad slow code:

txt1 <- c('The quick brown fox jumps over the lazy dog')
txt2 <- c('Te quick foks jump ovar lazey dogg')

words <- strsplit(as.character(txt1), " ")
words.p <- strsplit(as.character(txt2), " ")

r <- length(words[[1]])
c <- length(words.p[[1]])

m <- matrix(nrow=r, ncol=c)
for (i in 1:r){
  for (j in 1:c){
    m[i,j] = stringdist(tolower(words.p[[1]][j]), tolower(words[[1]][i]), method='jaccard', q=2)
  }
}

ind <- which(m == min(m))-nrow(m)
words[[1]][ind]

Please help me to improve and beautify this code for large data frame.

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
Dennix
  • 109
  • 1
  • 8
  • How large is "large", and how long does it take using your code? – lukeA Nov 25 '16 at 12:25
  • Try this `sapply(words.p, function(x) mapply(stringdist, words, x, method='jaccard'))`. This will directly give you a matrix which you can easily examine. – Chirayu Chamoli Nov 25 '16 at 14:05

1 Answers1

3

Preparation (added tolower here):

txt1 <- c('The quick brown fox jumps over the lazy dog')
txt2 <- c('Te quick foks jump ovar lazey dogg')

words <- unlist(strsplit(tolower(as.character(txt1)), " "))
words.p <- unlist(strsplit(tolower(as.character(txt2)), " "))

Get distances for each word:

dists <- sapply(words, Map, f=stringdist, list(words.p), method="jaccard")

For each word in words find the closest word from words.p:

matches <- words.p[sapply(dists, which.min)]

cbind(words, matches)
              matches
 [1,] "the"   "te"
 [2,] "quick" "quick"
 [3,] "brown" "ovar"
 [4,] "fox"   "foks"
 [5,] "jumps" "jump"
 [6,] "over"  "ovar"
 [7,] "the"   "te"
 [8,] "lazy"  "lazey"
 [9,] "dog"   "dogg"

EDIT:

To get the best matching word pair you first need to select the minimum distance from each word in words to all words in words.p:

mindists <- sapply(dists, min)

This will get your best possible distances for each word. Then you select the word from words with the minimum distance:

words[which.min(mindists)]

Or in one line:

words[which.min(sapply(dists, min))]
marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
Karolis Koncevičius
  • 9,417
  • 9
  • 56
  • 89
  • Thanks! But I want to get only one best word, in this case this is "quick". How to extract it? – Dennix Nov 25 '16 at 13:53
  • @Dennix added a line about how to do that in the answer (after EDIT) – Karolis Koncevičius Nov 25 '16 at 18:00
  • @KarolisKoncevičius, Thank you for your solution. I was looking for something similar but for matching list of addresses. So I have a dataset that contains around 70K different addresses and another large dataset that contains around 4 Lack records (0.4 million). I want to match each address with the large dataset looking at the words of the address. How can I achieve this? I have posted a question at the link, http://stackoverflow.com/questions/42486172/r-string-match-for-address-using-stringdist-stringdistmatrix Please help!! – user1412 Mar 04 '17 at 15:00