How to make a fuzzy join in R using more than one variable on each side

Question

I would like to join the two data frames :

a <- data.frame(x=c(1,3,5))
b <- data.frame(start=c(0,4),end=c(2,6),y=c("a","b"))

with a condition like (x>start)&(x<end) in order to get such a result:

#  x    y
#1 1    a
#2 2 <NA>
#3 3    b

I don't want to make a potentially large cartesian product and then select only the few rows matching the condition and I'd like a solution using the tidyverse (I am not interested in a solution using SQL which would be a confession of failure). I thought of the 'fuzzyjoin' package but I cannot find examples fitting my need : the function to apply for the condition has only two arguments. I also tried to put 'start' and 'end' into a single argument with data.frame(z=I(purrr::map2(b$start,b$end,list)),y=b$y) # z y #1 0, 2 a #2 4, 6 b

but although the data looks fine fuzzy_left_join doesn't accept it.

I search for solutions working in more general cases (n variables on the LHS, m on the RHS, not necessarily numeric with arbitrary conditions).

UPDATE

I also want to be able to express conditions like (x=start+1)|(x=end+1) giving here:

#   x  y
#1  1  a
#2  3  a
#3  5  b

score 5 · Answer 1 · answered Mar 02 '19 at 09:11

5

For this case you don't need multi_by or multy_match_fun, this works :

library(fuzzyjoin)
fuzzy_left_join(a, b, by = c(x = "start", x = "end"), match_fun = list(`>`, `<`))
#   x start end    y
# 1 1     0   2    a
# 2 3    NA  NA <NA>
# 3 5     4   6    b

answered Mar 02 '19 at 09:11

moodymudskipper

46,417
11
121
167

Nicolas2 · Answer 2 · 2018-05-30T09:20:30.193

I eventually went to the code of fuzzy_join and found a way to make what I want even without proper documentation. fuzzy_let_join doesn't work but there is the following way (not really pretty and it actually does a cartesian product):

g <- function(x,y) (x>y[,"start"])&(x<y[,"end"])
fuzzy_join(a,b, multi_by = list(x="x",y=c("start","end"))
              , multi_match_fun = g, mode = "left") %>% select(x,y)

Prem · Answer 3 · 2018-06-01T07:48:44.960

1

data.table approach could be

library(data.table)

name1 <- setdiff(names(setDT(b)), names(setDT(a))) 
#perform left outer join and then select required columns
a[b, (name1) := mget(name1), on = .(x > start, x < end)][, .(x, y)]

which gives

   x    y
1: 1    a
2: 3 <NA>
3: 5    b

Sample data:

a <- data.frame(x = c(1, 3, 5))
b <- data.frame(start = c(0, 4), end = c(2, 6), y = c("a", "b"))

Update: In case you want to join both dataframes on (x=start+1)|(x=end+1) condition then you can try

library(data.table)

DT1 <- as.data.table(a)
DT2 <- as.data.table(b)

#Perform 1st join on "x = start+1" and then another on "x = end+1". Finally row-bind both results.
DT <- rbindlist(list(DT1[DT2[, start_temp := start+1], on = c(x = "start_temp"), .(x, y), nomatch = 0], 
                     DT1[DT2[, end_temp := end+1], on = c(x = "end_temp"), .(x, y), nomatch = 0]))
DT
#   x y
#1: 1 a
#2: 5 b
#3: 3 a

edited Jun 01 '18 at 07:48

answered May 29 '18 at 13:27

Prem

11,775
1
19
33

This works fine on my example, but I didn't manage to use my condition as I gave it (in one piece), nor to use something like (x+y)>start. What would be the syntax? – Nicolas2 May 30 '18 at 08:05
`on` criteria mentioned in my answer is "(`x`>`start`) & (`x`<`end`)". So I am really not sure if I understood your followup question. Could you please share your new input data & desired output? – Prem May 30 '18 at 08:52
> a[b, (name1) := mget(name1), on = .((x > start)&(x < end))][, .(x, y)] Error in `[.data.table`(a, b, `:=`((name1), mget(name1)), on = .((x > : Column(s) [(x] not found in x – Nicolas2 May 30 '18 at 09:18
Run the code provided in my answer as is. `on = .(x > start, x < end)` means (x > start)&(x < end) so you don't need to explicitly mention the way you have done in your previous comment. In case my code throws error then share the reproducible example along with error log for the same. – Prem May 30 '18 at 13:08
In my first question, there were the need to deal with more complicated example than just being between two values and other connectors than the 'and' (even if it seems not meaning full). – Nicolas2 May 30 '18 at 13:45
Yes, it works. Somebody suggested me something like that for the second specific case : just do the two sub requests and merge them. But my question wasn't apparently clear enough. What I was searching was some general answer that could allow for example to give the actual condition as a simple parameter, like in dplyr::filter. it is more a question of meta programming than a question of solving a specific case of joining two specific data set on a specific condition. Should I have to program a solution by myself? – Nicolas2 Jun 01 '18 at 08:11
I think to generalize it you may look into an option to write a function. One of the approach could be to identify type of condition and accordingly pass your function's flow to execute the type of 'Join'. You may also want to post a new question by rephrasing it accordingly. – Prem Jun 01 '18 at 08:25

score 1 · Answer 4 · answered Jun 01 '18 at 09:10

A possible answer to explain what I am trying to do : extending dplyr in some way. And I will be happy to know if there are ways to improve this solution or some problems I didn't see. The solution avoids the cartesian product, but duplicates into lists of data frames both one of the input data frame and the result. I didn't include the final column selection of x and y that is easy to code.

my_left_join <- function(.DATA1,.DATA2,.WHERE)
  {
  call = as.list(match.call())
  df1 <- .DATA1
  df1$._row_ <- 1:nrow(df1)
  dfl1 <- replyr::replyr_split(df1,"._row_")
  eval(substitute(
    dfl2 <- mapply(function(.x) 
                  {filter(.DATA2,with(.x,WHERE)) %>%
                   mutate(._row_=.x$._row_)}
                  , dfl1, SIMPLIFY=FALSE)
    ,list(WHERE=call$.WHERE))) 
  df2 <- replyr::replyr_bind_rows(dfl2)
  left_join(df1,df2,by="._row_") %>% select(-._row_)
  }

my_left_join(a,b,(x>start)&(x<end))
#  x start end    y
#1 1     0   2    a
#2 3    NA  NA <NA>
#3 5     4   6    b

my_left_join(a,b,(x==(start+1))|(x==(end+1)))
#  x start end y
#1 1     0   2 a
#2 3     0   2 a
#3 5     4   6 b

score 0 · Answer 5 · answered May 29 '18 at 11:48

You can try a GenomicRanges solution

library(GenomicRanges)
# setup GRanges objects
a_gr <- GRanges(1, IRanges(a$x,a$x))
b_gr <- GRanges(1, IRanges(b$start, b$end))
# find overlaps between the two data sets
res <- as.data.frame(findOverlaps(a_gr,b_gr))
# create the expected output
a$y <- NA
a$y[res$queryHits] <- as.character(b$y)[res$subjectHits]
a
  x    y
1 1    a
2 3 <NA>
3 5    b

How to make a fuzzy join in R using more than one variable on each side

5 Answers5