1

Here is my sample data:

a <- data.frame(name = c('Ace CO', 'Bayes', 'aasd', 'Apple', 'Orange', 'Banana',
                         'Ace CO', 'Bayes', 'aasd', 'Apple', 'Orange', 'Banana'),
                date=c(1991,1991,1991,1991,1991,1991,
                       1992,1992,1992,1992,1992,1992),
                price = c(10, 13, 2, 1, 15, 1,
                          11,15,3,2,14,4))
b <- data.frame(name = c('Ace Co.', 'Bayes INC.', 'asd',
                         'Ace Co.', 'Bayes INC.', 'asd'),
                date=c(1991,1991,1991,1992,1992,1992),
                qty = c(9, 99, 10,10,105,15))

I am left joining a to b by date and name, date is exact while name is fuzzy. I have tried stringdist_join but it only accomdates fuzzy merge.

The expected output is as follows:

c<- data.frame(name = c('Ace Co.', 'Bayes INC.', 'asd',
                         'Ace Co.', 'Bayes INC.', 'asd'),
                date=c(1991,1991,1991,1992,1992,1992),
                qty = c(9, 99, 10,10,105,15),
                price = c(10, 13, 2,11,15,3))

I'd like to manipulate it under dplyr.

derek
  • 47
  • 5
  • Should `"Bayes"` match `"Bayes INC."`? – Rui Barradas Jul 23 '21 at 04:10
  • How do you choose which name to keep? i.e. `asdf`, `asd` `aasdf` for merging? – Paul Jul 23 '21 at 05:45
  • You might find these posts interesting: https://www.r-bloggers.com/2015/02/fuzzy-string-matching-a-survival-skill-to-tackle-unstructured-information/, https://cran.r-project.org/web/packages/stringdist/stringdist.pdf and https://stackoverflow.com/a/30132177/10264278 – Paul Jul 23 '21 at 06:07
  • I have edited a little bit. The dictionary name is from data b. If you see expected output c, the name is same as that in b. I am merging a to b, but the name in a is slightly dfferent from b. "Ace CO" match "Ace Co." "Bayes" match "Bayes INC." "aasd" match "asd" . – derek Jul 23 '21 at 08:41

2 Answers2

1

agrep solution

The following function is almost surely not as general as it is supposed to be. But here it goes.

funMerge <- function(X, Y, col, col_approx, sep = "."){
  other_cols.x <- setdiff(names(X), c(col, col_approx))
  other_cols.y <- setdiff(names(Y), c(col, col_approx))
  sp.x <- split(X, X[[col]])
  sp.y <- split(Y, Y[[col]])
  common_names <- intersect(names(sp.x), names(sp.y))


  res <- sapply(common_names, function(sp.name){
    x <- sp.x[[sp.name]]
    y <- sp.y[[sp.name]]
    k <- sapply(x[[col_approx]], agrep, y[[col_approx]])
    k <- k[sapply(k, length) > 0]
    k <- unlist(k)
    i <- match(names(k), x[[col_approx]])
    df_other.x <- x[k, other_cols.x, drop = FALSE]
    df_other.y <- y[k, other_cols.y, drop = FALSE]
    df_tmp <- data.frame(
      x[k, col], 
      names(k), 
      y[k, col_approx]
    )
    names(df_tmp) <- c(col, col_approx, paste(col_approx, "y", sep = sep))
    cbind(df_tmp, df_other.x, df_other.y)
  }, simplify = FALSE)
  res <- do.call(rbind, res)
  row.names(res) <- NULL
  res
}

funMerge(a, b, col = "date", col_approx = "name")
#  date   name     name.y price qty
#1 1991 Ace Co    Ace Co.    10   9
#2 1991  Bayes Bayes Inc.    13  99
#3 1991    asd       asdf     2  10
#4 1992 Ace Co    Ace CO.    11  10
#5 1992  Bayes Bayes INC.    15 105
#6 1992    asd      aasdf     3  15

stringdist solution

The following function uses package stringdist to compute the Jaro-Winkler pairwise distances between the columns that need to be matched approximately.

From help('stringdist-metrics'), my emphasis.

The metric you need to choose for an application strongly depends on both the nature of the string (what does the string represent?) and the cause of dissimilarities between the strings you are measuring. For example, if you are comparing human-typed names that may contain typo's, the Jaro-Winkler distance may be of use. If you are comparing names that were written down after hearing them, a phonetic distance may be a better choice.

A more efficient algorithm would be to first split the data sets by the exact match column and then apply the method of funMerge2.

library(stringdist)

funMerge2 <- function(X, Y, col, col_approx, method = "jw", threshold = 0.2){
  x <- X[[col_approx]]
  y <- Y[[col_approx]]
  d <- stringdistmatrix(x, y, method = method, useBytes = FALSE)
  w <- which(d < threshold, arr.ind = TRUE)
  Z1 <- X[w[, "row"], ]
  Z2 <- Y[w[, "col"], ]
  res <- cbind(Z1, Z2)
  common_cols <- grep(col, names(res))
  res <- res[apply(res[, common_cols], 1, function(x) x[1] == x[2]), ]
  row.names(res) <- NULL
  res
}

funMerge2(a, b, col = "date", col_approx = "name")
#    name date price       name date qty
#1 Ace Co 1991    10    Ace Co. 1991   9
#2  Bayes 1991    13 Bayes Inc. 1991  99
#3    asd 1991     2       asdf 1991  10
#4 Ace Co 1992    11    Ace CO. 1992  10
#5  Bayes 1992    15 Bayes INC. 1992 105
#6    asd 1992     3      aasdf 1992  15
Rui Barradas
  • 70,273
  • 8
  • 34
  • 66
1

Using distance matrix to merge fuzzy strings

Main principle

Get the distance matrix between each unique terms of you vectors. Then, check what threshold might lead to the best results (this has to be human supervised I think). Then, use this new correspondance table to merge your dataframes. Finallyyou can change names (i.e. adding "inc.") easier because you have "standardized" names.

With utils::adist()

I think stringdist is better because you can choose the method, but here is a base example as a suggestion on how to use this concept of distance to get the expected output.

# 1st create a matrix with the Standard Levenshtein distance between the name fields of both sources (or other method from stringdist)
dist_name_matrix <- adist(unique(a$name), unique(b$name), partial = TRUE, ignore.case = TRUE)
colnames(dist_name_matrix) <- unique(b$name)
rownames(dist_name_matrix) <- unique(a$name)

# lets convert this matrix to a dataframe for more visual changes, you will need to check it yourself
library(dplyr)
library(tidyr)

dist_df <- dist_name_matrix %>% 
  as.data.frame() %>% 
  tibble::rownames_to_column(., "a_name") %>% 
  pivot_longer(cols = 2:last_col(), names_to = "b_name", values_to = "dist") %>% 
  filter(dist < 2) # you might need to adapt this to your needs

# Now this can be used to merge your data i.e

a %>% 
  left_join(., dist_df, by = c("name" = "a_name")) %>% 
  right_join(., b, by = c("b_name" = "name", "date" = "date")) %>% 
  # added just to match your expected output
  filter(!is.na(name)) %>% 
  select(b_name, date, qty, price)

Output:

      b_name date qty price
1    Ace Co. 1991   9    10
2 Bayes INC. 1991  99    13
3        asd 1991  10     2
4    Ace Co. 1992  10    11
5 Bayes INC. 1992 105    15
6        asd 1992  15     3

Same process can be used with stringdist:

library(stringdist)
dist_name_matrix <- stringdistmatrix(unique(a$name), unique(b$name), method = "jw", useBytes = FALSE)
colnames(dist_name_matrix) <- unique(b$name)
rownames(dist_name_matrix) <- unique(a$name)

Then just adapt the threshold after human check i.e. filter(dist < 0.2)

Paul
  • 2,850
  • 1
  • 12
  • 37
  • 1
    Paul, would you mind taking a look at my edited question, I've changed the name a little bit. The expected output c has the same name as b, I am just merging a to b, but a has slightly different name from b. So the final output c should keep the dictionary name in b. – derek Jul 23 '21 at 08:57
  • @derek, I have edited my answer, I changed merging process so b is the dictionnary and I also changed the treshold for "similarities" from `0` to `< 2`. Please keep in minf that this treshold is very important. It will change with the method you will use and the degree to which you will accept or not the proposed connections – Paul Jul 23 '21 at 09:27