How to perform a fuzzy join with fuzzyjoin::difference_* in R

Question

I'm working with two different datasets that I want to merge based on a threshold. Let's say the two dataframes look like this:

library(dplyr)
library(fuzzyjoin)
library(lubridate)

df1 = data_frame(Item=1:5, 
                 DateTime=c("2015-01-01 11:12:14", "2015-01-02 09:15:23", 
                            "2015-01-02 15:46:11", "2015-04-19 22:11:33", 
                            "2015-06-10 07:00:00"), 
                 Count=c(1, 6, 11, 15, 9), 
                 Name="Sterling", 
                 Friend=c("Pam", "Cyril", "Cheryl", "Mallory", "Lana"))
df1$DateTime = ymd_hms(df1$DateTime)

df2 = data_frame(Item=21:25, 
                 DateTime=c("2015-01-01 11:12:15", "2015-01-02 19:15:23", 
                            "2015-01-02 15:46:11", "2015-05-19 22:11:33", 
                            "2015-06-10 07:00:02"), 
                 Count=c(3, 7, 11, 15, 8), 
                 Name="Sterling", 
                 Friend=c("Pam", "Kreger", "Woodhouse", "Gillete", "Lana"))
df2$DateTime = ymd_hms(df2$DateTime)

What I would like now, is to be able to left join df2 with df1 based on a fuzzy match of DateTime and Count being within two seconds of their respective values, while all other values except Item are identical. I thought I could get there with the following:

df1 %>%
  difference_left_join(df2, by=c("DateTime", "Count"), max_dist=2)

But that gives me the following output:

 # A tibble: 8 × 10
  Item.x          DateTime.x Count.x   Name.x Friend.x Item.y          DateTime.y Count.y   Name.y  Friend.y
   <int>              <dttm>   <dbl>    <chr>    <chr>  <int>              <dttm>   <dbl>    <chr>     <chr>
1      1 2015-01-01 11:12:14       1 Sterling      Pam     21 2015-01-01 11:12:15       3 Sterling       Pam
2      1 2015-01-01 11:12:14       1 Sterling      Pam     21 2015-01-01 11:12:15       3 Sterling       Pam
3      2 2015-01-02 09:15:23       6 Sterling    Cyril     NA                <NA>      NA     <NA>      <NA>
4      3 2015-01-02 15:46:11      11 Sterling   Cheryl     23 2015-01-02 15:46:11      11 Sterling Woodhouse
5      3 2015-01-02 15:46:11      11 Sterling   Cheryl     23 2015-01-02 15:46:11      11 Sterling Woodhouse
6      4 2015-04-19 22:11:33      15 Sterling  Mallory     NA                <NA>      NA     <NA>      <NA>
7      5 2015-06-10 07:00:00       9 Sterling     Lana     25 2015-06-10 07:00:02       8 Sterling      Lana
8      5 2015-06-10 07:00:00       9 Sterling     Lana     25 2015-06-10 07:00:02       8 Sterling      Lana

This is close, except that row 3 should not have merged given that the names are different (and I would have expected row 2 to merge given the thresholds, even though I don't want it to).

How do I end up with the following dataframe? Note that the second row and third row from df2 was not merged despite DateTime and Count meeting the threshold limit. This is because the other columns (except Item) were not identical.

desired_output
#   Item            DateTime Count     Name  Friend
# 1    3 2015-01-02 15:46:11    11 Sterling  Cheryl
# 2    2 2015-01-02 09:15:23     6 Sterling   Cyril
# 3    5 2015-06-10 07:00:00     9 Sterling    Lana
# 4   25 2015-06-10 07:00:02     8 Sterling    Lana
# 5    4 2015-04-19 22:11:33    15 Sterling Mallory
# 6    1 2015-01-01 11:12:14     1 Sterling     Pam
# 7   21 2015-01-01 11:12:15     3 Sterling     Pam

I think it gives you the error you mentioned if you don't use `lubridate` on the data before the join. Otherwise if gives you a different message. BTW sorry if I'm missing something obvious but where does "Duration" come from? — Hack-R, Sep 22 '16 at 17:00
Ah, sorry! I'm dummied up the data to mask what it really is, but then kept using real variable names. I'll clean that up. — tblznbits, Sep 22 '16 at 17:03
Also removed reference to the error given that you pointed out why it happened, and updated with the result of calling that line of code. — tblznbits, Sep 22 '16 at 17:10

Hack-R · Accepted Answer · 2016-09-22T19:37:36.333

OK, so, the message you got was because the fuzzy match can't be calculated on a non-numeric column.

The thing to do is to convert it to numeric. Since your caliper is in seconds I converted it to seconds and then made them numeric:

library(dplyr)
library(fuzzyjoin)
library(lubridate)

df1 = data_frame(Item=1:5, 
                 DateTime=c("2015-01-01 11:12:14", "2015-01-02 09:15:23", 
                            "2015-01-02 15:46:11", "2015-04-19 22:11:33", 
                            "2015-06-10 07:00:00"), 
                 Count=c(1, 6, 11, 15, 9), 
                 Name="Sterling", 
                 Friend=c("Pam", "Cyril", "Cheryl", "Mallory", "Lana"))
df1$DateTime1 = as.numeric(seconds(ymd_hms(df1$DateTime)))

df2 = data_frame(Item=21:25, 
                 DateTime=c("2015-01-01 11:12:15", "2015-01-02 19:25:56", 
                            "2015-01-02 15:46:11", "2015-05-19 22:11:33", 
                            "2015-06-10 07:00:02"), 
                 Count=c(3, 6, 11, 15, 8), 
                 Name="Sterling", 
                 Friend=c("Pam", "Kreger", "Woodhouse", "Gillete", "Lana"))
df2$DateTime1 = as.numeric(seconds(ymd_hms(df2$DateTime)))

df1 %>%
  difference_left_join(y=df2, by=c("DateTime1", "Count"), max_dist=2)

based on our discussion in the comments a simple tweak to subset this to the case of other character columns matching would be:

df1[df2$Friend == df1$Friend,] %>%
  difference_left_join(y=df2[df2$Friend == df1$Friend,], by=c("DateTime1", "Count"), max_dist=2)

That example is just for Friend but of course you could use & to do it with multiple columns.

Is it possible to specify that all character variables must match exactly? I suppose it would be easier to just take the output from `difference_left_join` and then filter out the records where, for example, `Name.x != Name.y`. — tblznbits, Sep 22 '16 at 17:19
@brittenb I thought you only wanted them to match within 2 seconds? — Hack-R, Sep 22 '16 at 17:33
Admittedly, after re-reading my question and the title, I wasn't very clear. I do want them to match within 2 seconds of each other, but only when every other column (except `Item`) matches. — tblznbits, Sep 22 '16 at 19:23
@brittenb Oh, ok I see. Yea I can do that as soon as I get a free second. — Hack-R, Sep 22 '16 at 19:35

How to perform a fuzzy join with fuzzyjoin::difference_* in R

1 Answers1