How to fuzzy join based on multiple columns and conditions?

Question

I'm trying to left join two data frames (df1, df2). The data frames have two columns in common: zone and slope. Zone is a factor column and slope is numeric.

    df1 = data.frame(slope = c(1:6), zone = c(rep("Low", 3), rep("High", 3)))
    df2 = data.frame(slope = c(2.4, 2.4,6.2), zone = c(rep("Low", 1), rep("High", 2)), other = c(rep("a", 1), rep("b", 1), rep("c", 1)))
    df1
    df2

I want to join the data frames such that they are first matched exactly on zone, and then the closest match for slope. If there are two slope values that equidistant, it doesn't matter if the join rounds up or down as long as the rule is applied consistently and it does not result in duplicate rows.

I'd prefer to do this with a fuzzy_join or dplyr rather than data.table.

The result should look something like:

    df3 = data.frame(slope = c(1:6), zone = c(rep("Low", 3), rep("High", 3)), other = c(rep("a", 3), rep("b",1), rep("c",2)))
    df3

where the value of "other" is first determined by zone, and then the closest slope.

I've tried:

    distance_left_join(df, df2, by=c("zone"= "zone", "slope"="slope"))

as well as other types of fuzzy joins, but I think they may not be working because the columns are of different types. I suspect there is a fuzzy_left_join solution, but I don't understand how to create a match function.

score 2 · Answer 1 · answered Oct 20 '20 at 06:46

Here is how to do fuzzy joins with multiple match_funs. If you want to mix complex match_funs, you'll have to define them with a function yourself as I did here: Passing arguments into multiple match_fun functions in R fuzzyjoin::fuzzy_join

df1 = data.frame(slope = c(1:6), zone = c(rep("Low", 3), rep("High", 3)))
df2 = data.frame(slope = c(2.4, 2.4,6.2), zone = c(rep("Low", 1), rep("High", 2)), other = c(rep("a", 1), rep("b", 1), rep("c", 1)))

library(fuzzyjoin); library(dplyr)

# First, need to define match_fun_distance. 
# This is copied from the source code for distance_join in https://github.com/dgrtwo/fuzzyjoin
match_fun_distance <- function(v1, v2) {
  
  # settings for this method
  method = "manhattan"
  max_dist = 99
  distance_col = "dist"
  
  if (is.null(dim(v1))) {
    v1 <- t(t(v1))
    v2 <- t(t(v2))
  }
  if (method == "euclidean") {
    d <- sqrt(rowSums((v1 - v2)^2))
  }
  else if (method == "manhattan") {
    d <- rowSums(abs(v1 - v2))
  }
  ret <- tibble::tibble(instance = d <= max_dist)
  if (!is.null(distance_col)) {
    ret[[distance_col]] <- d
  }
  ret
}

(joined_result <- fuzzy_join(df1, df2, 
                             by=c("zone"= "zone", "slope"="slope"), 
                             match_fun = list(`==`, match_fun_distance),
                             mode = "left"))
#>   slope.x zone.x slope.y zone.y other slope.dist zone.dist
#> 1       1    Low     2.4    Low     a        1.4        NA
#> 2       2    Low     2.4    Low     a        0.4        NA
#> 3       3    Low     2.4    Low     a        0.6        NA
#> 4       4   High     2.4   High     b        1.6        NA
#> 5       4   High     6.2   High     c        2.2        NA
#> 6       5   High     2.4   High     b        2.6        NA
#> 7       5   High     6.2   High     c        1.2        NA
#> 8       6   High     2.4   High     b        3.6        NA
#> 9       6   High     6.2   High     c        0.2        NA

joined_result %>%
  group_by(slope.x, zone.x) %>%
  top_n(1, -slope.dist)
#> # A tibble: 6 x 7
#> # Groups:   slope.x, zone.x [6]
#>   slope.x zone.x slope.y zone.y other slope.dist zone.dist
#>     <int> <fct>    <dbl> <fct>  <fct>      <dbl>     <dbl>
#> 1       1 Low        2.4 Low    a          1.4          NA
#> 2       2 Low        2.4 Low    a          0.400        NA
#> 3       3 Low        2.4 Low    a          0.6          NA
#> 4       4 High       2.4 High   b          1.6          NA
#> 5       5 High       6.2 High   c          1.2          NA
#> 6       6 High       6.2 High   c          0.2          NA

^{Created on 2020-10-20 by the reprex package (v0.3.0)}

Hi Arthur, can I ask you how the two different methods work? I'm trying to do fuzzy matching with several numerical variables but I need to assign different weights to distances according to the unit. For instance, if a dummy takes value 1 rather than 0 it is supposed to be a much worse match than a 1-unit difference if the variable is 'year'. Thanks a lot for any help — Antonio, May 04 '22 at 15:10
if you use above code, the different dists can be multiplied by your specified weights and then summed to get a combined distance for your specific purposes — Arthur Yip, May 04 '22 at 15:52
I see what you mean but I guess that's an ex-post computation. Since I have many possible matches for each of my observations, I would have to modify the function so to get the best match out of all the possible ones. Can I do that by modifying your code? — Antonio, May 04 '22 at 16:13

How to fuzzy join based on multiple columns and conditions?

1 Answers1

Linked