I have two data frames, d1 and d2:
d1 <- data.frame(ID=c("x1", "x1", "x1", "x1", "x1", "x2", "x2", "x2", "x2", "x2", "x2", "x2"),
stat=c("A", "B", "C", "D", "E", "H", "I", "J", "J", "K", "L", "M"),
OD=c("AE", "AE", "AE", "AE", "AE", "HJ", "HJ", "HJ", "JM", "JM", "JM", "JM"))
d2 <- data.frame(ID=c("x1", "x1", "x1", "x2", "x2", "x2"),
OD=c("AE", "AE", "BD", "HJ", "HJ", "JM"),
prod=c("p_2", "p_1", "p_3", "p_5", "p_4", "p_5"),
wgh=c(1000, 1300, 300, 2300, 1800, 2300))
The data is grouped by 'ID'. The 'stat' column are stations (or sites) visited. The 'OD' column is the origin and destination stations concatenated.
d2 has product id:s 'prod' and their weight 'wgh', which I want to join to d1. The variables to join by are ID and OD.
d1
ID stat OD
1 x1 A AE
2 x1 B AE
3 x1 C AE
4 x1 D AE
5 x1 E AE
6 x2 H HJ
7 x2 I HJ
8 x2 J HJ
9 x2 J JM
10 x2 K JM
11 x2 L JM
12 x2 M JM
d2
ID OD prod wgh
1 x1 AE p_2 1000
2 x1 AE p_1 1300
3 x1 BD p_3 300 # origin B & and dest. D within the range of 'stat' in d1, but 'OD' doesn't match
4 x2 HJ p_5 2300
5 x2 HJ p_4 1800
6 x2 JM p_5 2300
Sometimes OD matches between the two data sets, e.g. ID = "x1" & OD = "AE". Fine.
However, the main issue is that there are cases in d2 when the origin occurs after the origin in d1, and when the destination occurs before the destination in d1. For example, in d2, where ID = x1, the OD = BD is not present in d1 (comment in data). The stations in d2, B to D, have been visited on the way between A and E. This means that 'prod' and 'wgh' from d2 should be joined to rows in d1 where 'stat' is from B to D.
In the result, the number of rows of d1, should remain the same. However, the number of columns would be increased, as I was thinking to put up the available prod product ids as variables, and then match the wgh weights to those.
The result table would look something like this:
Maybe merge and/or dcast is a good way to start. I was even thinking of ignoring OD and just make an origin and destination column, then somehow fill down values in between when they match with stat, but I got stuck.
Note that my real data are millions of row long.