2

I try to merge two data.tables, but due to different spelling in stock names I lose a substantial number of data points. Hence, instead of an exact match I was looking into a fuzzy merge.

library("data.table")
dt1 = data.table(Name = c("ASML HOLDING","ABN AMRO GROUP"), A = c(1,2))
dt2 = data.table(Name = c("ASML HOLDING NV", "ABN AMRO GROUP"), B = c("p", "q"))

When merging dt1 and dt2 on "Name", ASML HOLDING will be excluded due to the addition of "NV", while the actual data would be accurate.

The prefered final data output would look somthing like:

              Name A B
1:  ABN AMRO GROUP 2 q
2: ASML HOLDING NV 1 p

What I tried next was the following:

dt1 = dt1[, dt2_NAME := agrep(dt1$Name, dt2$Name, ignore.case = TRUE, value = TRUE, max.distance = 0.05, useBytes = TRUE)]

However, I get the following error,

argument 'pattern' has length > 1 and only the first element will be used

The error makes sense as dt1$Name is longer than 1, but I believe it would be a possible solution if it would consider dt1$Name on a row to row basis.

It might be a stupid mistake, but for some reason I just can't get my head around it. Furthermore, I prefer to use data.table as my dataset is fairly large and up till now it has worked splendidly. Additionally, I am new to stack overflow, so sorry if my question is somewhat off.

Lastly, I found a piece of code which does the job, but is too slow for practical usage. Fuzzy merge in R

dt1$Name_dt2 <- "" # Creating an empty column
for(i in 1:dim(dt1)[1]) {
  x <- agrep(dt1$Name[i], dt2$Name,
             ignore.case=TRUE, value=TRUE,
             max.distance = 0.05, useBytes = TRUE)
  x <- paste0(x,"")
  dt1$Name_dt2[i] <- x
}
Hjalmar
  • 122
  • 10

1 Answers1

2

A possible solution using 'fuzzyjoin':

library(fuzzyjoin)
f <- Vectorize(function(x,y) agrepl(x, y,
                                   ignore.case=TRUE,
                                   max.distance = 0.05, useBytes = TRUE))

dt1 %>% fuzzy_inner_join(dt2, by="Name", match_fun=f)
#          Name.x A          Name.y B
#1   ASML HOLDING 1 ASML HOLDING NV p
#2 ABN AMRO GROUP 2  ABN AMRO GROUP q

NOTE : The main problem, that you encountered too, was that agrep and agrepl don't seem to expect the first argument to be a vector. That's the reason why I wrapped the call with Vectorize.

This method can be used together with an equi-join (mind the order of columns in the by!):

dt1 = data.frame(Name = c("ASML HOLDING","ABN AMRO GROUP"), A = c(1,2),Date=c(1,2))
dt2 = data.frame(Name = c("ASML HOLDING NV", "ABN AMRO GROUP", "ABN AMRO GROUP"), B = c("p", "q","r"),Date=c(1,2,3))

dt1 %>% fuzzy_inner_join(dt2, by=c("Date","Name"), match_fun=f) %>% filter(Date.x==Date.y)
Nicolas2
  • 2,170
  • 1
  • 6
  • 15
  • Thank you.I had not the change to check it for myself, but would this method also work if I had an additonal key ("Date") to merge on which has to be an exact match. Basically can I use the combination of an exact match (Date) for 1 key and a fuzzy match (Name) for another? I will also add this to my original question. – Hjalmar Sep 19 '18 at 10:36
  • Yes, I edited my answer. With the function I used, you have to do that in two steps, but the package also includes the possibility of using more than one column from each data set. It is just a bit more complicated to find the corct syntax.. – Nicolas2 Sep 19 '18 at 11:00