fuzzyjoin two data frames using data.table

Question

I have been working on a fuzzyjoin to join 2 data frames together however due to memory issues the join causes cannot allocate memory of…. So I am trying to join the data using data.table. A sample of the data is below.

df1 looks like:

        ID     f_date               ACCNUM    flmNUM start_date   end_date
1    50341 2002-03-08 0001104659-02-000656   2571187 2002-09-07 2003-08-30
2  1067983 2009-11-25 0001047469-09-010426  91207220 2010-05-27 2011-05-19
3   804753 2004-05-14 0001193125-04-088404   4805453 2004-11-13 2005-11-05
4  1090727 2013-05-22 0000712515-13-000022  13865105 2013-11-21 2014-11-13
5  1467858 2010-02-26 0001193125-10-043035  10640035 2010-08-28 2011-08-20
6   858877 2019-01-31 0001166691-19-000005  19556540 2019-08-02 2020-07-24
7     2488 2016-02-24 0001193125-16-476010 161452982 2016-08-25 2017-08-17
8  1478242 2004-03-12 0001193125-04-039482   4664082 2004-09-11 2005-09-03
9  1467858 2017-02-16 0001555280-17-000044  17618235 2017-08-18 2018-08-10
10   14693 2015-10-28 0001193125-15-356351 151180619 2016-04-28 2017-04-20

df2 looks like:

     ID       date fyear     at     lt
1 50341 1998-12-31  1998 104382  94973
2 50341 1999-12-31  1999 190692 175385
3 50341 2000-12-31  2000 179519 163347
4 50341 2001-12-31  2001 203638 186030
5 50341 2002-12-31  2002 190453 173620
6 50341 2003-12-31  2003 200235 181955

I will focus on the ID = 50341. If df2$date is in the time period of df1$start_date and df1$end_date then join them together. So here df2$date = 2002-12-31 which is in between df1 start 2002-09-07 and end 2003-08-30, therefore join this row.

I run the following code and get the corresponding output:

df1$f_date <- as.Date(df1$f_date)
df2$date <- as.Date(df2$date)

df1$start_date <- df1$f_date + 183
df1$end_date <- df1$f_date + 540

library(fuzzyjoin)
final_data <- fuzzy_left_join(
  df1, df2,
  by = c(
    "ID" = "ID",
    "start_date" = "date",
    "end_date" = "date"
  ),
  match_fun = list(`==`, `<`, `>=`)
)

final_data

Output:

      ID.x     f_date               ACCNUM    flmNUM start_date   end_date    ID.y       date fyear         at         lt
1    50341 2002-03-08 0001104659-02-000656   2571187 2002-09-07 2003-08-30   50341 2002-12-31  2002 190453.000 173620.000
2  1067983 2009-11-25 0001047469-09-010426  91207220 2010-05-27 2011-05-19 1067983 2010-12-31  2010 372229.000 209295.000
3   804753 2004-05-14 0001193125-04-088404   4805453 2004-11-13 2005-11-05  804753 2004-12-31  2004    982.265    383.614
4  1090727 2013-05-22 0000712515-13-000022  13865105 2013-11-21 2014-11-13 1090727 2013-12-31  2013  36212.000  29724.000
5  1467858 2010-02-26 0001193125-10-043035  10640035 2010-08-28 2011-08-20 1467858 2010-12-31  2010 138898.000 101739.000
6   858877 2019-01-31 0001166691-19-000005  19556540 2019-08-02 2020-07-24      NA       <NA>    NA         NA         NA
7     2488 2016-02-24 0001193125-16-476010 161452982 2016-08-25 2017-08-17    2488 2016-12-31  2016   3321.000   2905.000
8  1478242 2004-03-12 0001193125-04-039482   4664082 2004-09-11 2005-09-03      NA       <NA>    NA         NA         NA
9  1467858 2017-02-16 0001555280-17-000044  17618235 2017-08-18 2018-08-10 1467858 2017-12-31  2017 212482.000 176282.000
10   14693 2015-10-28 0001193125-15-356351 151180619 2016-04-28 2017-04-20   14693 2016-04-30  2015   4183.000   2621.000

Here we can see that ID= 50341 is joined up correctly.

When I try to run the data.table way I get this output:

Code:

dt_final_data <- setDT(df2)[df1, on = .(ID, date > start_date, date <= end_date)]

Output:

         ID       date fyear         at         lt     date.1     f_date               ACCNUM    flmNUM
 1:   50341 2002-09-07  2002 190453.000 173620.000 2003-08-30 2002-03-08 0001104659-02-000656   2571187
 2: 1067983 2010-05-27  2010 372229.000 209295.000 2011-05-19 2009-11-25 0001047469-09-010426  91207220
 3:  804753 2004-11-13  2004    982.265    383.614 2005-11-05 2004-05-14 0001193125-04-088404   4805453
 4: 1090727 2013-11-21  2013  36212.000  29724.000 2014-11-13 2013-05-22 0000712515-13-000022  13865105
 5: 1467858 2010-08-28  2010 138898.000 101739.000 2011-08-20 2010-02-26 0001193125-10-043035  10640035
 6:  858877 2019-08-02    NA         NA         NA 2020-07-24 2019-01-31 0001166691-19-000005  19556540
 7:    2488 2016-08-25  2016   3321.000   2905.000 2017-08-17 2016-02-24 0001193125-16-476010 161452982
 8: 1478242 2004-09-11    NA         NA         NA 2005-09-03 2004-03-12 0001193125-04-039482   4664082
 9: 1467858 2017-08-18  2017 212482.000 176282.000 2018-08-10 2017-02-16 0001555280-17-000044  17618235
10:   14693 2016-04-28  2015   4183.000   2621.000 2017-04-20 2015-10-28 0001193125-15-356351 151180619
dt_final_data

Here start_date in df1 has now become date and end_date in df1 has become date.1. Therefore my original date column in df2 has disappeared which is one of the more important dates for checking if the merge worked as it should.

Two questions:

How can I keep all the date columns as in the fuzzyjoin example? The way data.table has changed the names makes it a little confusing when I am checking the join.

Is the code/logic correct? I have looked at this joined data a number of times and it "appears" correct.

Data1:

df1 <- 
    structure(list(ID = c(50341L, 1067983L, 804753L, 1090727L, 1467858L, 
858877L, 2488L, 1478242L, 1467858L, 14693L), f_date = structure(c(11754, 
14573, 12552, 15847, 14666, 17927, 16855, 12489, 17213, 16736
), class = "Date"), ACCNUM = c("0001104659-02-000656", "0001047469-09-010426", 
"0001193125-04-088404", "0000712515-13-000022", "0001193125-10-043035", 
"0001166691-19-000005", "0001193125-16-476010", "0001193125-04-039482", 
"0001555280-17-000044", "0001193125-15-356351"), flmNUM = c(2571187L, 
91207220L, 4805453L, 13865105L, 10640035L, 19556540L, 161452982L, 
4664082L, 17618235L, 151180619L), 
start_date = structure(c(11937, 14756, 12735, 16030, 14849, 18110, 17038, 
                         12672, 17396, 16919), class = "Date"), 
end_date = structure(c(12294, 15113, 13092, 16387, 15206, 18467, 17395, 13029,
                       17753, 17276), class = "Date")
), row.names = c(NA, -10L), class = "data.frame")

Data2:

df2 <-
    structure(list(ID = c(2488L, 2488L, 2488L, 2488L, 2488L, 2488L, 
2488L, 2488L, 2488L, 2488L, 2488L, 2488L, 2488L, 2488L, 2488L, 
2488L, 2488L, 2488L, 2488L, 2488L, 2488L, 1067983L, 1067983L, 
1067983L, 1067983L, 1067983L, 1067983L, 1067983L, 1067983L, 1067983L, 
1067983L, 1067983L, 1067983L, 1067983L, 1067983L, 1067983L, 1067983L, 
1067983L, 1067983L, 1067983L, 1067983L, 1067983L, 14693L, 14693L, 
14693L, 14693L, 14693L, 14693L, 14693L, 14693L, 14693L, 14693L, 
14693L, 14693L, 14693L, 14693L, 14693L, 14693L, 14693L, 14693L, 
14693L, 14693L, 14693L, 50341L, 50341L, 50341L, 50341L, 50341L, 
50341L, 1467858L, 1467858L, 1467858L, 1467858L, 1467858L, 1467858L, 
1467858L, 1467858L, 1467858L, 1467858L, 1467858L, 1467858L, 1467858L, 
1467858L, 1467858L, 1467858L, 1467858L, 1467858L, 1467858L, 1467858L, 
1467858L, 1090727L, 1090727L, 1090727L, 1090727L, 1090727L, 1090727L, 
1090727L, 1090727L, 1090727L, 1090727L, 1090727L, 1090727L, 1090727L, 
1090727L, 1090727L, 1090727L, 1090727L, 1090727L, 1090727L, 1090727L, 
1090727L, 804753L, 804753L, 804753L, 804753L, 804753L, 804753L, 
804753L, 804753L, 804753L, 804753L, 804753L, 804753L, 804753L, 
804753L, 804753L, 804753L, 804753L, 804753L, 804753L, 804753L, 
804753L, 1478242L, 1478242L, 1478242L, 1478242L, 1478242L, 1478242L, 
1478242L, 1478242L, 1478242L, 1478242L, 858877L, 858877L, 858877L, 
858877L, 858877L, 858877L, 858877L, 858877L, 858877L, 858877L, 
858877L, 858877L, 858877L, 858877L, 858877L, 858877L, 858877L, 
858877L, 858877L, 858877L, 858877L), date = structure(c(10591, 
10956, 11322, 11687, 12052, 12417, 12783, 13148, 13513, 13878, 
14244, 14609, 14974, 15339, 15705, 16070, 16435, 16800, 17166, 
17531, 17896, 10591, 10956, 11322, 11687, 12052, 12417, 12783, 
13148, 13513, 13878, 14244, 14609, 14974, 15339, 15705, 16070, 
16435, 16800, 17166, 17531, 17896, 10346, 10711, 11077, 11442, 
11807, 12172, 12538, 12903, 13268, 13633, 13999, 14364, 14729, 
15094, 15460, 15825, 16190, 16555, 16921, 17286, 17651, 10591, 
10956, 11322, 11687, 12052, 12417, 10591, 10956, 11322, 11687, 
12052, 12417, 12783, 13148, 13513, 13878, 14244, 14609, 14974, 
15339, 15705, 16070, 16435, 16800, 17166, 17531, 17896, 10591, 
10956, 11322, 11687, 12052, 12417, 12783, 13148, 13513, 13878, 
14244, 14609, 14974, 15339, 15705, 16070, 16435, 16800, 17166, 
17531, 17896, 10591, 10956, 11322, 11687, 12052, 12417, 12783, 
13148, 13513, 13878, 14244, 14609, 14974, 15339, 15705, 16070, 
16435, 16800, 17166, 17531, 17896, 14609, 14974, 15339, 15705, 
16070, 16435, 16800, 17166, 17531, 17896, 10438, 10803, 11169, 
11534, 11899, 12264, 12630, 12995, 13360, 13725, 14091, 14456, 
14821, 15186, 15552, 15917, 16282, 16647, 17013, 17378, 17743
), class = "Date"), fyear = c(1998L, 1999L, 2000L, 2001L, 2002L, 
2003L, 2004L, 2005L, 2006L, 2007L, 2008L, 2009L, 2010L, 2011L, 
2012L, 2013L, 2014L, 2015L, 2016L, 2017L, 2018L, 1998L, 1999L, 
2000L, 2001L, 2002L, 2003L, 2004L, 2005L, 2006L, 2007L, 2008L, 
2009L, 2010L, 2011L, 2012L, 2013L, 2014L, 2015L, 2016L, 2017L, 
2018L, 1997L, 1998L, 1999L, 2000L, 2001L, 2002L, 2003L, 2004L, 
2005L, 2006L, 2007L, 2008L, 2009L, 2010L, 2011L, 2012L, 2013L, 
2014L, 2015L, 2016L, 2017L, 1998L, 1999L, 2000L, 2001L, 2002L, 
2003L, 1998L, 1999L, 2000L, 2001L, 2002L, 2003L, 2004L, 2005L, 
2006L, 2007L, 2008L, 2009L, 2010L, 2011L, 2012L, 2013L, 2014L, 
2015L, 2016L, 2017L, 2018L, 1998L, 1999L, 2000L, 2001L, 2002L, 
2003L, 2004L, 2005L, 2006L, 2007L, 2008L, 2009L, 2010L, 2011L, 
2012L, 2013L, 2014L, 2015L, 2016L, 2017L, 2018L, 1998L, 1999L, 
2000L, 2001L, 2002L, 2003L, 2004L, 2005L, 2006L, 2007L, 2008L, 
2009L, 2010L, 2011L, 2012L, 2013L, 2014L, 2015L, 2016L, 2017L, 
2018L, 2009L, 2010L, 2011L, 2012L, 2013L, 2014L, 2015L, 2016L, 
2017L, 2018L, 1998L, 1999L, 2000L, 2001L, 2002L, 2003L, 2004L, 
2005L, 2006L, 2007L, 2008L, 2009L, 2010L, 2011L, 2012L, 2013L, 
2014L, 2015L, 2016L, 2017L, 2018L), at = c(4252.968, 4377.698, 
5767.735, 5647.242, 5619.181, 7094.345, 7844.21, 7287.779, 13147, 
11550, 7675, 9078, 4964, 4954, 4000, 4337, 3767, 3109, 3321, 
3540, 4556, 122237, 131416, 135792, 162752, 169544, 180559, 188874, 
198325, 248437, 273160, 267399, 297119, 372229, 392647, 427452, 
484931, 526186, 552257, 620854, 702095, 707794, 1494, 1735, 1802, 
1939, 2016, 2264, 2376, 2624, 2728, 3551, 3405, 3475, 3383, 3712, 
3477, 3626, 4103, 4193, 4183, 4625, 4976, 104382, 190692, 179519, 
203638, 190453, 200235, 257389, 274730, 303100, 323969, 370782, 
448507, 479921, 476078, 186192, 148883, 91047, 136295, 138898, 
144603, 149422, 166344, 177677, 194520, 221690, 212482, 227339, 
17067, 23043, 21662, 24636, 26357, 28909, 33026, 35222, 33210, 
39042, 31879, 31883, 33597, 34701, 38863, 36212, 35471, 38311, 
40377, 45403, 50016, 436.485, 660.891, 616.411, 712.302, 779.279, 
859.34, 982.265, 1303.629, 1491.39, 1689.956, 1880.988, 2148.567, 
2422.79, 3000.358, 3704.468, 4098.364, 4530.565, 5561.984, 5629.963, 
6469.311, 6708.636, NA, NA, 2322.917, 2499.153, 3066.797, 3305.832, 
3926.316, 21208, 22742, 22549, 8916.705, 14725, 32870, 35238, 
37795, 37107, 35594, 33883, 43315, 53340, 58734, 68128, 81130, 
87095, 91759, 101191, 105134, 113481, 121652, 129818, 108784), 
    lt = c(2247.919, 2398.425, 2596.068, 2092.187, 3151.916, 
    3938.395, 3993.516, 3700.954, 7072, 8295, 7588, 7354, 3951, 
    3364, 3462, 3793, 3580, 3521, 2905, 2929, 3290, 63190, 72232, 
    72799, 103453, 104116, 102218, 102216, 106025, 137756, 149759, 
    153820, 161334, 209295, 223686, 235864, 260446, 283159, 293630, 
    334495, 350141, 355294, 677, 818, 754, 752, 705, 1424, 1291, 
    1314, 1165, 1978, 1680, 1659, 1488, 1652, 1408, 1998, 2071, 
    2288, 2621, 3255, 3660, 94973, 175385, 163347, 186030, 173620, 
    181955, 241738, 253490, 272218, 303516, 363134, 422932, 452164, 
    460442, 190443, 184363, 176387, 107340, 101739, 105612, 112422, 
    123170, 141653, 154197, 177615, 176282, 184562, 9894, 10569, 
    11927, 14388, 13902, 14057, 16642, 18338, 17728, 26859, 25099, 
    24187, 25550, 27593, 34130, 29724, 33313, 35820, 39948, 44373, 
    46979, 165.342, 281.954, 272.694, 317.463, 338.035, 363.494, 
    383.614, 541.81, 571.972, 556.242, 568.693, 567.769, 517.373, 
    689.557, 870.818, 930.7, 964.597, 1691.6, 1702.016, 1683.963, 
    1780.247, NA, NA, 3292.513, 3858.197, 3734.282, 4009.844, 
    4261.997, 12348, 14384, 15595, 1766.98, 3003, 6328, 8096, 
    9124, 9068, 9678, 10699, 19397, 21850, 24332, 29451, 36845, 
    39836, 40458, 42063, 48473, 53774, 58067, 63681, 65580)), row.names = c(NA, 
-163L), class = "data.frame")

you can dupe the columns first before using them: `setDT(df2[, d:=date])[df1[, c("sdt","edt") := .(start_date, end_date)], on = .(ID, d>sdt, d<=edt)]` — chinsoon12, Apr 08 '19 at 01:10
You can use `fuzzyjoin::interval_left_join`, it uses package `IRanges` and doesn't do a cartesian product — moodymudskipper, Apr 08 '19 at 18:02
Thanks! do you think this will solve the memory issues? The `data.table` method seems very efficient but it just modifys/removes some of the columns, @chinsoon seems to help correct this, I just needed to rename/remove one or two columns after - to clean the data a little. — user8959427, Apr 08 '19 at 19:10
I believe *data.table* sorts the data and does a binary search for every inequality condition, maybe with `IRanges` this is done more efficiently by `interval_join` but I only used it on small data and just to test it so I can't say. — moodymudskipper, Apr 09 '19 at 15:22
@Moody_Mudskipper interesting comment on the timings but is that comparing recent versions? Just asking as the current 1.12 version of data.table performs extremely well e.g. [1.22 seconds for 1.4M records in this question](https://stackoverflow.com/a/55088083/10312356) using non-equi joins on dates suggesting it's pretty well optimized. However, I have not used IRanges to know which is faster. If you had time it would be really interesting if you were able to add an IRanges solution to that other question for comparison - certainly worth an upvote there if you did. — krads, Apr 13 '19 at 03:16

krads · Accepted Answer · 2019-04-13T14:06:39.397

To clarify terminology:

The data.table approach for your problem does not require a fuzzyjoin with data.table [at least not in the sense of inexact matching]. Instead, you just want to join on data.table columns using non-equal binary operators >=,>, <= and/or <. In data.table terminology those are called "non equi joins".

Where you titled your question "fuzzyjoin two data frames using data.table" that is just, understandably, after you used library(fuzzyjoin) in your first working attempt. (No problem, just clarifying for readers.)

Solution using `data.table` non equi joins to compare date columns:

You were very close to a working data.table solution where you had:

dt_final_data <- setDT(df2)[df1, 
                            on = .(ID, date > start_date, date <= end_date)]

To modify it to make it work as you want, simply add a data.table j expression to select the columns you want, in the order you want them EDIT: and prefix the problem column with x. (to tell data.table to return the column from the x side of the dt_x[dt_i,] join) For example, as below calling the column x.date:

dt_final_data <- setDT(df2)[df1, 
                            .(ID, f_date, ACCNUM, flmNUM, start_date, end_date, x.date, fyear, at, lt), 
                            on = .(ID, date > start_date, date <= end_date)]

This now gives you the output you are after:

dt_final_data
         ID     f_date               ACCNUM    flmNUM start_date   end_date     x.date fyear         at         lt
 1:   50341 2002-03-08 0001104659-02-000656   2571187 2002-09-07 2003-08-30 2002-12-31  2002 190453.000 173620.000
 2: 1067983 2009-11-25 0001047469-09-010426  91207220 2010-05-27 2011-05-19 2010-12-31  2010 372229.000 209295.000
 3:  804753 2004-05-14 0001193125-04-088404   4805453 2004-11-13 2005-11-05 2004-12-31  2004    982.265    383.614
 4: 1090727 2013-05-22 0000712515-13-000022  13865105 2013-11-21 2014-11-13 2013-12-31  2013  36212.000  29724.000
 5: 1467858 2010-02-26 0001193125-10-043035  10640035 2010-08-28 2011-08-20 2010-12-31  2010 138898.000 101739.000
 6:  858877 2019-01-31 0001166691-19-000005  19556540 2019-08-02 2020-07-24       <NA>    NA         NA         NA
 7:    2488 2016-02-24 0001193125-16-476010 161452982 2016-08-25 2017-08-17 2016-12-31  2016   3321.000   2905.000
 8: 1478242 2004-03-12 0001193125-04-039482   4664082 2004-09-11 2005-09-03       <NA>    NA         NA         NA
 9: 1467858 2017-02-16 0001555280-17-000044  17618235 2017-08-18 2018-08-10 2017-12-31  2017 212482.000 176282.000
10:   14693 2015-10-28 0001193125-15-356351 151180619 2016-04-28 2017-04-20 2016-04-30  2015   4183.000   2621.000

As above, your result for ID=50341 now has date=2002-12-31. In other words, the result column date now comes from df2.date.

You can of course rename the x.date column in your j expression:

setDT(df2)[ df1, 
            .(ID, 
              f_date, 
              ACCNUM, 
              flmNUM, 
              start_date, 
              end_date, 
              my_result_date_name = x.date, 
              fyear, 
              at, 
              lt), 
            on = .(ID, date > start_date, date <= end_date)]

Why does data.table (currently) rename columns in non-equi joins and return data from a different column:

This explanation from @ScottRitchie sums it up quite nicely:

When performing any join, only one copy of each key column is returned in the result. Currently, the column from i is returned, and labelled with the column name from x, making equi joins consistent with the behaviour of base merge().

Above makes sense if you keep in mind back before version 1.9.8 data.table didn't have non-equi joins.

Through and including the current 1.12.2 release of data.table, this (and several overlapping issues) have been the source a lot of discussion on the data.table github issues list. For example: possible inconsistency in non-equi join, returning join columns #3437 and SQL-like column return for non-equi and rolling joins #2706 are just 2 of many.

However, watch this github issue: Continuing from the above discussions the keen analytical minds of the data.table team are working to make this less confusing in some (hopefully not too distant) future version: Both columns for rolling and non-equi joins #3093

Thanks! However here the problem still persists, the `date` column gets overwritten by the `start_date` and they both have the same date. Running the following `df2 %>% filter(ID == "50341")` Gives a small output for the first row in your output. There is no `date = 2002-09-07` in this output. I should have instead in your output `date = 2002-12-31`. — user8959427, Apr 12 '19 at 16:32
OK I see what you mean now. That just requires the column to be specified as `x.date` in this case to tell data.table to pick the column from x side of the join (as in `x[ i,]`) I'll amend my answer as soon as I have time. — krads, Apr 13 '19 at 00:27
Thanks, this was very helpful. However, I need it to be a full join, and now I get a left join. How do I resolve this? — Wilkit, Sep 02 '20 at 14:21
I've solved my earlier question by a simple merge between the new file and df2. But I have another question; if I (sometimes) have multiple start_date and en_date ranges in df1 that match date in df2, it imputes duplicates of df1. I need it to match the ones with the closest start_date and date match. Although I may have multiple similare start_dates. It seems a impossible to me. Any suggestions for working around this issue? — Wilkit, Sep 02 '20 at 14:39

fuzzyjoin two data frames using data.table

1 Answers1

To clarify terminology:

Solution using `data.table` non equi joins to compare date columns:

Why does data.table (currently) rename columns in non-equi joins and return data from a different column:

Linked

fuzzyjoin two data frames using data.table

1 Answers1

To clarify terminology:

Solution using data.table non equi joins to compare date columns:

Why does data.table (currently) rename columns in non-equi joins and return data from a different column:

Linked

Solution using `data.table` non equi joins to compare date columns: