-2

Let's consider two dataframes df1 and df2. I would like to join dataframes based on the date difference only. For Example;

Dataframe 1: (df1)

| version_id | date_invoiced | product_id |
-------------------------------------------
|     1      | 03-07-2020    |   201      |
|     1      | 02-07-2020    |   2013     |
|     3      | 02-07-2020    |   2011     |
|     6      | 01-07-2020    |   2018     |
|     7      | 01-07-2020    |   201      |

Dataframe 2: (df2)

|  validfrom   | pricelist| pricelist_id |
------------------------------------------
|02-07-2020    |   10     |      101     |
|01-07-2020    |   20     |      102     |
|29-06-2020    |   30     |      103     |
|28-07-2020    |   10     |      104     |
|25-07-2020    |   5      |      105     |

I need to map the pricelist_id and the pricelist based on the the validfrom column present in df2. Say that, based on the least difference between the date_invoiced (df1) and validfrom (df2), the row should be mapped.

Expected Outcome:

| version_id | date_invoiced | product_id | date_diff | pricelist_id | pricelist |
----------------------------------------------------------------------------------
|     1      | 03-07-2020    |   201      |     1     |     101      |     10    |
|     1      | 02-07-2020    |   2013     |     1     |     102      |     20    |
|     3      | 02-07-2020    |   2011     |     1     |     102      |     20    |
|     6      | 01-07-2020    |   2018     |     1     |     103      |     30    |
|     7      | 01-07-2020    |   201      |     1     |     103      |     30    |

I need to map purely based on the difference and the difference should be the least. Always, the date_invoiced (df1), should have closest difference comparing to validfrom (df2). Thanks

JBH
  • 101
  • 6
  • Maybe the package `fuzzyjoin` could help you in this case. See this question: https://stackoverflow.com/questions/58718287/fuzzyjoin-with-dates-in-r – Nico Jul 03 '20 at 07:49
  • Shared reproducible examples – JBH Jul 03 '20 at 10:49

1 Answers1

1

Perhaps you might want to try using date.table and nearest roll. Here, the join is made on DATE which would be DATEINVOICED from df1 and VALIDFROM in df2.

library(data.table)

setDT(df1)
setDT(df2)

df1$DATEINVOICED <- as.Date(df1$DATEINVOICED, format = "%d-%m-%y")
df2$VALIDFROM <- as.Date(df2$VALIDFROM, format = "%d-%m-%y")

setkey(df1, DATEINVOICED)[, DATE := DATEINVOICED]
setkey(df2, VALIDFROM)[, DATE := VALIDFROM]

df2[df1, on = "DATE", roll='nearest']
Ben
  • 28,684
  • 5
  • 23
  • 45
  • Although the solution is close enough, I have two conditions to match; 1) df2 row count will be lesser than df1. 2) VALIDFROM should always be lesser than DATEINVOICED, that is if the DATEINVOICED is 06-07-20, then the VALIDFROM should be the closest match 05-07-20. – JBH Jul 06 '20 at 06:21
  • I see...sounds like you want to make sure the `VALIDFORM` is less than - and *not* equal to - `DATEINVOICED`...you can also try `df2[df1, on = .(DATE < DATE), mult = "last"]` assuming the dates are in order... – Ben Jul 06 '20 at 16:38
  • Or potentially you could use `roll = Inf` instead of `nearest` in the answer, but then would need to set the `DATE` in `df2` to one day earlier (`DATE - 1`) so it strictly makes sure `VALIDFROM` is less than `DATEINVOICED` and *not* equal to it...hope this is helpful... – Ben Jul 06 '20 at 16:44