2

I am trying to merge two data sets with a string place which is similar but in many cases not exactly the same.

For example there can be records like London, UK and just London in the other dataset. I would like to have a matching score next to each record and merge them on the similarity score (and merge if it is greater than .80).

Is there a way to merge two datasets A and B with variable place?

The first dataset A has four variables:name, place, origin, ranking

The second dataset B has three variables: ratings, place, destinations, trip time

I want to merge dataset A and dataset B.

The final merged dataset should have the following variables: name, place, origin, ranking, ratings, destinations, trip time

I want to merge based on dataset A (like a left join in SQL). I want all records of A with added columns from B.

zx8754
  • 52,746
  • 12
  • 114
  • 209
user3570187
  • 1,743
  • 3
  • 17
  • 34
  • a similar question was asked minutes ago. If it doesn't help, expand on your question with examples. http://stackoverflow.com/questions/30129903/how-can-i-fuzzy-match-corporations-names-when-using-two-separate-apis-datab – Pierre L May 08 '15 at 19:15
  • I have expanded the question to be more specific. – user3570187 May 08 '15 at 19:27
  • [How to make a great R reproducible example?](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) – zx8754 May 08 '15 at 19:47

1 Answers1

4

stringdist library has many functions, here is one example using soundex similarity:

library(stringdist)

#dummy data
df1 <- data.frame(ID1=c("london","USA","UK","ball","xyz"))
df2 <- data.frame(ID2=c("lndon","US","UKS","bull","ttt"),
                  x=1:5)

#get soundex
df1$soundex <- phonetic(df1$ID1, method = c("soundex"), 
                        useBytes = FALSE)
df2$soundex <- phonetic(df2$ID2, method = c("soundex"), 
                        useBytes = FALSE)

#merge on soundex column, keep all rows from df1
merge(df1, df2, by="soundex", all.x=TRUE)

#output
#   soundex    ID1   ID2  x
# 1    B400   ball  bull  4
# 2    L535 london lndon  1
# 3    U200    USA    US  2
# 4    U200    USA   UKS  3
# 5    U200     UK    US  2
# 6    U200     UK   UKS  3
# 7    X200    xyz  <NA> NA
zx8754
  • 52,746
  • 12
  • 114
  • 209