0

Here is an example of my data

id address

Table1:User table
id     address
1      mont carlo road,CA
2      mont road,IS
3      mont carlo road1-11,CA

Table 2(The output I wanna get)
Similarity Matrix
id   1    2    3

1  

2    3  

3    1    3

1~3 very similar~very dissimilar

My problem is how to recognize the similarity between the case by address in the Table 1, and then output a result, say Similarity Matrix like Table 2 in R. The point is how to figure out the comparison between two sentences in R and then set a scale to measure the similarity between a pair, finally output a matrix.

Dennis Meng
  • 5,109
  • 14
  • 33
  • 36
user3566160
  • 21
  • 2
  • 5
  • http://stackoverflow.com/questions/6704499/algorithm-to-compare-similarity-of-english-sentences – KFB Oct 17 '14 at 05:31
  • @KFB Thanks for your suggestion. I am looking for a detailed method/algorithm in R. – user3566160 Oct 17 '14 at 05:35
  • See my answer with RecordLinkage to this question: http://stackoverflow.com/questions/26405895/how-can-i-match-fuzzy-match-strings-from-two-datasets#26408600 – lawyeR Oct 17 '14 at 10:24

2 Answers2

0

You might be interested in the Levenshtein Distance implemented in the R package stringdist. For example:

library(stringdist)
address <- c("mont carlo road,CA", "mont road,IS", "mont carlo road1-11,CA")
stringdist(address[1], address[2], method="lv")
[1] 8

You could then tailor these results to a matrix or whatever output you desire

Stedy
  • 7,359
  • 14
  • 57
  • 77
0

I'd also use the stringdist package but would make use of outer and cut to finish the job:

library(stringdist)
dat <- data.frame(
    address = c("mont carlo road,CA", "mont road,IS", "mont carlo road1-11,CA"),
    id = 1:3
)

m <- outer(dat[["address"]], dat[["address"]], stringdist, method="jw")

m[lower.tri(m)] <- cut(m[lower.tri(m)], 3, labels=1:3)
m[upper.tri(m)] <- cut(m[upper.tri(m)], 3, labels=1:3)
dimnames(m) <- list(dat[["id"]], dat[["id"]])
diag(m) <- NA
m

##    1  2  3
## 1 NA  3  1
## 2  3 NA  3
## 3  1  3 NA

You can use whatever method you want for calculating distance (?stringdist).

Tyler Rinker
  • 108,132
  • 65
  • 322
  • 519