0

I have two data frames both with columns containing texts. Now I want to merge those data frames by using (imperfect) matches between the text columns. If e.g. cell 1 of the text column of data frame 1 has a text value that contains a (part of a) word that resembles a (part of a) word in the text value of cel 2 of the text column of data frame 2, then I want the data frames be me merged using these cells. What is the best way to do this in R?

I am not sure if my question is clear enough, but if so, does anyone know of an R package or a function that can help me do this kind of merging?

Many thanks in advance!

rdatasculptor
  • 8,112
  • 14
  • 56
  • 81

1 Answers1

1

Try the RecordLinkage package.

Here is a possible solution where the merge works based on generally how "close" the two "words" match:

library(reshape2)
library(RecordLinkage)
set.seed(16)
l <- LETTERS[1:10]
ex1 <- data.frame(lets = paste(l, l, l, sep = ""), nums = 1:10)
ex2 <- data.frame(lets = paste(sample(l), sample(l), sample(l), sep = ""), 
                  nums = 11:20)
ex1
# lets nums
# 1   AAA    1
# 2   BBB    2
# 3   CCC    3
# 4   DDD    4
# 5   EEE    5
# 6   FFF    6
# 7   GGG    7
# 8   HHH    8
# 9   III    9
# 10  JJJ   10
ex2
# lets nums
# 1   GDJ   11
# 2   CFH   12
# 3   DBE   13
# 4   BED   14
# 5   FJB   15
# 6   JHG   16
# 7   AII   17
# 8   ICC   18
# 9   EGF   19
# 10  HAA   20
lets <- melt(outer(ex1$lets, ex2$lets, FUN = "levenshteinDist"))
lets <- lets[lets$value < 2, ] # adjust the "< 2" as necessary
cbind(ex1[lets$Var1, ], ex2[lets$Var2, ])
# lets nums lets nums
# 9  III    9  AII   17
# 3  CCC    3  ICC   18
# 1  AAA    1  HAA   20
Jack Ryan
  • 2,134
  • 18
  • 26
  • Thanks! This seems to work. Maybe you can add library(reshape2) to the code. – rdatasculptor Aug 09 '13 at 21:26
  • Thanks, I will. I came across this post as well http://www.r-bloggers.com/approximate-string-matching-in-r/ Have you seen it? Interesting package in this domain I think. – rdatasculptor Aug 10 '13 at 18:02