38

This question is somewhat related to issues Efficiently merging two data frames on a non-trivial criteria and Checking if date is between two dates in r. And the one I have posted here requesting if the feature exist: GitHub issue

I am looking to join two dataframes using dplyr::left_join(). The condition I use to join is less-than, greater-than i.e, <= and >. Does dplyr::left_join() support this feature? or do the keys only take = operator between them. This is straightforward to run from SQL (assuming I have the dataframe in the database)

Here is a MWE: I have two datasets one firm-year (fdata), while second is sort of survey data that happens once every five years. So for all years in the fdata that are in between two survey years, I join the corresponding survey year data.

id <- c(1,1,1,1,
        2,2,2,2,2,2,
        3,3,3,3,3,3,
        5,5,5,5,
        8,8,8,8,
        13,13,13)

fyear <- c(1998,1999,2000,2001,1998,1999,2000,2001,2002,2003,
       1998,1999,2000,2001,2002,2003,1998,1999,2000,2001,
       1998,1999,2000,2001,1998,1999,2000)

byear <- c(1990,1995,2000,2005)
eyear <- c(1995,2000,2005,2010)
val <- c(3,1,5,6)

sdata <- tbl_df(data.frame(byear, eyear, val))

fdata <- tbl_df(data.frame(id, fyear))

test1 <- left_join(fdata, sdata, by = c("fyear" >= "byear","fyear" < "eyear"))

I get

Error: cannot join on columns 'TRUE' x 'TRUE': index out of bounds 

Unless if left_join can handle the condition, but my syntax is missing something?

M--
  • 25,431
  • 8
  • 61
  • 93
rajvijay
  • 1,641
  • 4
  • 23
  • 28
  • You may wish to subscribe to https://github.com/tidyverse/dplyr/issues/2240 – Arthur Yip Jun 06 '17 at 03:19
  • 1
    I'm happy to see the new version of dplyr will support non-equi joins, see new answer below for example of the syntax, very similar to what you originally wanted: `left_join(fdata, sdata, join_by(fyear >= byear,fyear < eyear))` – Jon Spring Aug 25 '22 at 18:40

5 Answers5

29

data.table adds non-equi joins starting from v 1.9.8

library(data.table) #v>=1.9.8
setDT(sdata); setDT(fdata) # converting to data.table in place

fdata[sdata, on = .(fyear >= byear, fyear < eyear), nomatch = 0,
      .(id, x.fyear, byear, eyear, val)]
#    id x.fyear byear eyear val
# 1:  1    1998  1995  2000   1
# 2:  2    1998  1995  2000   1
# 3:  3    1998  1995  2000   1
# 4:  5    1998  1995  2000   1
# 5:  8    1998  1995  2000   1
# 6: 13    1998  1995  2000   1
# 7:  1    1999  1995  2000   1
# 8:  2    1999  1995  2000   1
# 9:  3    1999  1995  2000   1
#10:  5    1999  1995  2000   1
#11:  8    1999  1995  2000   1
#12: 13    1999  1995  2000   1
#13:  1    2000  2000  2005   5
#14:  2    2000  2000  2005   5
#15:  3    2000  2000  2005   5
#16:  5    2000  2000  2005   5
#17:  8    2000  2000  2005   5
#18: 13    2000  2000  2005   5
#19:  1    2001  2000  2005   5
#20:  2    2001  2000  2005   5
#21:  3    2001  2000  2005   5
#22:  5    2001  2000  2005   5
#23:  8    2001  2000  2005   5
#24:  2    2002  2000  2005   5
#25:  3    2002  2000  2005   5
#26:  2    2003  2000  2005   5
#27:  3    2003  2000  2005   5
#    id x.fyear byear eyear val

You can also get this to work with foverlaps in 1.9.6 with a little more effort.

David Arenburg
  • 91,361
  • 17
  • 137
  • 196
eddi
  • 49,088
  • 6
  • 104
  • 155
  • 2
    `setDF` can be used afterwards if somebody wants to returns his datasets to plain data.frame – jangorecki May 24 '16 at 18:58
  • @eddi After the join, in fetching columns is there a data.table equivalent of .(i.*, x.fear) i.e. all columns from table i but only fear from table x Thank you. – Sweepy Dodo Jul 03 '19 at 13:58
  • This solution is cleaner and faster than the `tidyr`/`dplyr` one, and works when adding more conditions. – Nakx Apr 28 '20 at 03:07
  • 1
    The fyear >= byear, fyear < eyear looks great and appears to imply and AND operator. My code is not correctly selecting the records within each period. I've searched the docs, have seen a reference to a mysterious vignette about join and rolling join that I can't find. Could you point me to the right place to understand conditional element of the join? – Chris Nov 15 '20 at 17:58
  • Turns out I need to add the equivalent of .(fyear = x. fyear,.... after the on element, otherwise fyear seems to get replaced by byear (as per https://stackoverflow.com/questions/41043047/complexe-non-equi-merge-in-r). Not sure why but it works now. If there was a explanation of this in a vignette, that would be great. – Chris Nov 15 '20 at 18:29
29

The original answer below is out of date, as pointed out in another answer. With newer versions of dplyr, simply use the following. (Note that this syntax works even with database backends using dbplyr.)

fdata %>% 
left_join(sdata,
          join_by(fyear >= byear, fyear < eyear))

When the original answer was created, there was no easy way to do inequality joins using dplyr.

Original answer

Use a filter. (But note that this answer does not produce a correct LEFT JOIN; but the MWE gives the right result with an INNER JOIN instead.)

The dplyr package isn't happy if asked merge two tables without something to merge on, so in the following, I make a dummy variable in both tables for this purpose, then filter, then drop dummy:

fdata %>% 
    mutate(dummy=TRUE) %>%
    left_join(sdata %>% mutate(dummy=TRUE)) %>%
    filter(fyear >= byear, fyear < eyear) %>%
    select(-dummy)

And note that if you do this in PostgreSQL (for example), the query optimizer sees through the dummy variable as evidenced by the following two query explanations:

> fdata %>% 
+     mutate(dummy=TRUE) %>%
+     left_join(sdata %>% mutate(dummy=TRUE)) %>%
+     filter(fyear >= byear, fyear < eyear) %>%
+     select(-dummy) %>%
+     explain()
Joining by: "dummy"
<SQL>
SELECT "id" AS "id", "fyear" AS "fyear", "byear" AS "byear", "eyear" AS "eyear", "val" AS "val"
FROM (SELECT * FROM (SELECT "id", "fyear", TRUE AS "dummy"
FROM "fdata") AS "zzz136"

LEFT JOIN 

(SELECT "byear", "eyear", "val", TRUE AS "dummy"
FROM "sdata") AS "zzz137"

USING ("dummy")) AS "zzz138"
WHERE "fyear" >= "byear" AND "fyear" < "eyear"


<PLAN>
Nested Loop  (cost=0.00..50886.88 rows=322722 width=40)
  Join Filter: ((fdata.fyear >= sdata.byear) AND (fdata.fyear < sdata.eyear))
  ->  Seq Scan on fdata  (cost=0.00..28.50 rows=1850 width=16)
  ->  Materialize  (cost=0.00..33.55 rows=1570 width=24)
        ->  Seq Scan on sdata  (cost=0.00..25.70 rows=1570 width=24)

and doing it more cleanly with SQL gives exactly the same result:

> tbl(pg, sql("
+     SELECT *
+     FROM fdata 
+     LEFT JOIN sdata 
+     ON fyear >= byear AND fyear < eyear")) %>%
+     explain()
<SQL>
SELECT "id", "fyear", "byear", "eyear", "val"
FROM (
    SELECT *
    FROM fdata 
    LEFT JOIN sdata 
    ON fyear >= byear AND fyear < eyear) AS "zzz140"


<PLAN>
Nested Loop Left Join  (cost=0.00..50886.88 rows=322722 width=40)
  Join Filter: ((fdata.fyear >= sdata.byear) AND (fdata.fyear < sdata.eyear))
  ->  Seq Scan on fdata  (cost=0.00..28.50 rows=1850 width=16)
  ->  Materialize  (cost=0.00..33.55 rows=1570 width=24)
        ->  Seq Scan on sdata  (cost=0.00..25.70 rows=1570 width=24)
Ian Gow
  • 3,098
  • 1
  • 25
  • 31
  • 1
    As noted, this does not produce a proper left_join because it drops rows from the left dataset without matches within the filter on the right. – Patrick Sep 18 '19 at 20:09
23

This looks like it is the sort of task that package fuzzyjoin addresses. The various functions of the package look and work similar to the dplyr join functions.

In this case one of the fuzzy_*_join functions will work for you. The main difference between dplyr::left_join and fuzzyjoin::fuzzy_left_join is that you give a list of functions to use in the matching process with the match.fun argument. Note the by argument still is written the same as it would in left_join.

Below is an example. The functions I used to match on are >= and < for the fyear to byear and the fyear to eyear comparisons, respectively. The

library(fuzzyjoin)

fuzzy_left_join(fdata, sdata, 
             by = c("fyear" = "byear", "fyear" = "eyear"), 
             match_fun = list(`>=`, `<`))

Source: local data frame [27 x 5]

      id fyear byear eyear   val
   (dbl) (dbl) (dbl) (dbl) (dbl)
1      1  1998  1995  2000     1
2      1  1999  1995  2000     1
3      1  2000  2000  2005     5
4      1  2001  2000  2005     5
5      2  1998  1995  2000     1
6      2  1999  1995  2000     1
7      2  2000  2000  2005     5
8      2  2001  2000  2005     5
9      2  2002  2000  2005     5
10     2  2003  2000  2005     5
..   ...   ...   ...   ...   ...
aosmith
  • 34,856
  • 9
  • 84
  • 118
  • 1
    This does not scale well, but is a nice and easy to understand solution – wolfsatthedoor Jan 26 '20 at 04:12
  • @aosmith Thanks for the solution. Would it be possible to extend the interval? i.e., `fyear >= byear-20` and `fyear < eyear+5` – Prradep Nov 26 '20 at 11:44
  • This does not seem to work for dates - it chockes very fast, compared to an Excel vlookup on the same data. – s_a Aug 22 '21 at 22:25
10

dplyr v1.1.0 now includes the ability to do non-equi joins like this, with almost exactly the syntax you tried. For data with many partial matches, this will be much more performant than using fuzzyjoin or a filter step after an over-inclusive join.

# Relies on dplyr >=1.1.0, released Jan 2023
library(dplyr)
left_join(fdata, sdata, join_by(fyear >= byear,fyear < year))
Jon Spring
  • 55,165
  • 4
  • 35
  • 53
2

One option is to join row-wise as a list column, and then unnest the column:

# evaluate each row individually
fdata %>% 
    rowwise() %>% 
    # insert list column of single row of sdata based on conditions
    mutate(s = list(sdata %>% filter(fyear >= byear, fyear < eyear))) %>% 
    # unnest list column
    tidyr::unnest()

# Source: local data frame [27 x 5]
# 
#       id fyear byear eyear   val
#    (dbl) (dbl) (dbl) (dbl) (dbl)
# 1      1  1998  1995  2000     1
# 2      1  1999  1995  2000     1
# 3      1  2000  2000  2005     5
# 4      1  2001  2000  2005     5
# 5      2  1998  1995  2000     1
# 6      2  1999  1995  2000     1
# 7      2  2000  2000  2005     5
# 8      2  2001  2000  2005     5
# 9      2  2002  2000  2005     5
# 10     2  2003  2000  2005     5
# ..   ...   ...   ...   ...   ...
Julian
  • 6,586
  • 2
  • 9
  • 33
alistaire
  • 42,459
  • 4
  • 77
  • 117
  • 2
    Like my answer, this does not produce a valid `LEFT JOIN`. Augment the left data frame with observations with `fyear==2011` and then filter the result of your query on `fyear==2011` and there's nothing there. This works in SQL: `SELECT * FROM fdata LEFT JOIN sdata ON fyear >= year AND fyear < eyear`. – Ian Gow May 20 '16 at 02:06