Join based on a concatenated 'origin-destination' variable and consecutive individual sites

Question

I have two data frames, d1 and d2:

d1 <- data.frame(ID=c("x1", "x1", "x1", "x1", "x1", "x2", "x2", "x2", "x2", "x2", "x2", "x2"),
                   stat=c("A", "B", "C", "D", "E", "H", "I", "J", "J", "K", "L", "M"),
                   OD=c("AE", "AE", "AE", "AE", "AE", "HJ", "HJ", "HJ", "JM", "JM", "JM", "JM"))

d2 <- data.frame(ID=c("x1", "x1", "x1", "x2", "x2", "x2"),
                   OD=c("AE", "AE", "BD", "HJ", "HJ", "JM"),
                   prod=c("p_2", "p_1", "p_3", "p_5", "p_4", "p_5"),
                   wgh=c(1000, 1300, 300, 2300, 1800, 2300))

The data is grouped by 'ID'. The 'stat' column are stations (or sites) visited. The 'OD' column is the origin and destination stations concatenated.

d2 has product id:s 'prod' and their weight 'wgh', which I want to join to d1. The variables to join by are ID and OD.

d1
   ID stat OD
1  x1    A AE
2  x1    B AE 
3  x1    C AE
4  x1    D AE
5  x1    E AE
6  x2    H HJ
7  x2    I HJ
8  x2    J HJ
9  x2    J JM
10 x2    K JM
11 x2    L JM
12 x2    M JM

d2
  ID OD prod  wgh
1 x1 AE  p_2 1000
2 x1 AE  p_1 1300
3 x1 BD  p_3  300 # origin B & and dest. D within the range of 'stat' in d1, but 'OD' doesn't match   
4 x2 HJ  p_5 2300
5 x2 HJ  p_4 1800
6 x2 JM  p_5 2300

Sometimes OD matches between the two data sets, e.g. ID = "x1" & OD = "AE". Fine.

However, the main issue is that there are cases in d2 when the origin occurs after the origin in d1, and when the destination occurs before the destination in d1. For example, in d2, where ID = x1, the OD = BD is not present in d1 (comment in data). The stations in d2, B to D, have been visited on the way between A and E. This means that 'prod' and 'wgh' from d2 should be joined to rows in d1 where 'stat' is from B to D.

In the result, the number of rows of d1, should remain the same. However, the number of columns would be increased, as I was thinking to put up the available prod product ids as variables, and then match the wgh weights to those.

The result table would look something like this:

Maybe merge and/or dcast is a good way to start. I was even thinking of ignoring OD and just make an origin and destination column, then somehow fill down values in between when they match with stat, but I got stuck.

Note that my real data are millions of row long.

Reshape dframe_2 with `pivot_wider()` and `left_join(dframe_1, dframe_2` would do most of what you described I think — Chamkrai, Aug 02 '22 at 13:03

score 1 · Answer 1 · answered Aug 02 '22 at 13:04

A suggestion for the reshaping

library(tidyverse)

dframe_2 %>% 
  pivot_wider(names_from = prod,
              values_from = wgh,
              values_fill = 0)

left_join(dframe_1, dframe_2)

  ID    stat  OD     p_18  p_15   p_3  p_16  p_14
   <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
 1 X1    A     AE     1000  1300     0     0     0
 2 X1    B     AE     1000  1300     0     0     0
 3 X1    C     AE     1000  1300     0     0     0
 4 X1    D     AE     1000  1300     0     0     0
 5 X1    E     AE     1000  1300     0     0     0
 6 X2    F     HJ        0     0     0  2300  1800
 7 X2    G     HJ        0     0     0  2300  1800
 8 X2    H     HJ        0     0     0  2300  1800
 9 X2    I     JM        0     0     0  2300     0
10 X2    J     JM        0     0     0  2300     0
11 X2    K     JM        0     0     0  2300     0
12 X2    L     JM        0     0     0  2300     0

hey, works perfectly, thanks! Still thinking about the other issue — Rudolf Nyitray, Aug 02 '22 at 13:14

Ottie · Answer 2 · 2022-08-04T10:04:29.997

From AE to A, B, C, D, E

You want to expand your AE to multiple entries, alphabetically - you can use the LETTERS built-in for a reference alphabet, and use match to find the location of the initial and final letter. For example

OD <- "AE"
ODarray <- LETTERS[match(substr(OD,1,1), LETTERS) : match(substr(OD,2,2), LETTERS)]
print(ODarray)

[1] "A" "B" "C" "D" "E"

As per your comment, "AE" is just an example, and your codes are not alphabetically ordered letters. As long as the codes are characters with a linear order, you can follow the exact same procedure with a custom array instead of LETTERS, for example:

STATIONS <- c("Cockfosters", "Oakwood", "Southgate")

Then, in order to link your d1 and d2, you can loop this process to create a bridge df between OD and stat:

bridge = do.call(rbind, lapply(unique(d2$OD), function(od) 
    data.frame("OD" = od, 
               "stat" = LETTERS[match(substr(od,1,1), LETTERS) : 
                                match(substr(od,2,2), LETTERS)]
)))
print(bridge)

   OD stat
1  AE    A
2  AE    B
3  AE    C
4  AE    D
5  AE    E
6  BD    B
7  BD    C
8  BD    D
9  HJ    H
10 HJ    I
11 HJ    J
12 JM    J
13 JM    K
14 JM    L
15 JM    M

From BD to AE

The next problem is that you want to match the OD in d1 to a different OD in d2, for example the values in d2 that have an OD = "BD" should actually be matched to the OD = "AE" in d1.

Manual method

The easiest method is to create a named array a lookup, where each value in d1$OD maps to itself, and values that are in d2$OD but not d1$OD have an explicitly defined translation:

ODlookup <- c(
   setNames(unique(d1$OD), unique(d1$OD)), # identity map
   "BD" = "AE")                            # manual translations 
d2$newOD <- ODlookup[d2$OD]

Regex method

If every OD in d2 is a subset of one of the ODs in d1, then you can use regular expressions to look for one pattern into the other. The catch is that you'll need to run a similar expansion to the one you've done above, because "BD" is not a substring of "AE", but "BCD" (the expansion of "BD") is a substring of "ABCDE" (the expansion of "AE").

The function to generate this expansion is quite similar to the previous one:

expand <- function(od) {
    paste(LETTERS[match(substr(od,1,1), LETTERS) : 
                  match(substr(od,2,2), LETTERS)], 
    collapse="") }

expanded_d1 <- sapply(unique(d1$OD), expand)
expanded_d2 <- sapply(unique(d2$OD), expand)

print(expanded_d1)

     AE      HJ      JM 
"ABCDE"   "HIJ"  "JKLM"

Again you can play around to match the format of your actual data, for example if your codes are something like "ST0123-ST4567" then you would tweak the substr parameters to capture the correct substring.

You can now use grepl to identify which sequences of OD codes in d1 contain each of the possible OD codes in d2. grepl returns an array of booleans, with TRUE for successful matches, so you can use match(TRUE, grepl(...)) to extract the index of the first successful match.
Disclaimer: if your ranges overlap, there may be multiple possible matches (e.g. d1 contains the ranges "HJ" and "JM": an OD value of "J" could fit in either of those). match will just pick whichever one comes first.

ODmap <- sapply(expanded_d2, function(od) 
    names(expanded_d1)[match(TRUE, grepl(od, expanded_d1, fixed=TRUE))])

print(ODmap)

 AE   BD   HJ   JM 
"AE" "AE" "HJ" "JM"

As you can see, the "BD" value has been mapped to "AE" (and "AE" has also been mapped to "AE", because of course "ABCDE" is also contained in "ABCDE"). You now use this to create a new OD column in d2 where only values that exist in d1 are allowed:

d2$newOD <- ODmap[d2$OD]

print(d2)

  ID OD prod  wgh newOD
1 x1 AE  p_2 1000    AE
2 x1 AE  p_1 1300    AE
3 x1 BD  p_3  300    AE
4 x2 HJ  p_5 2300    HJ
5 x2 HJ  p_4 1800    HJ
6 x2 JM  p_5 2300    JM

From d1 and d2 to a single df

Merge the bridge into d2, and the resulting data.frame into d1. You'll have to specify you want to use the newly-created d2$newOD. Done!

merge(d1,merge(d2, bridge), 
    by.x = c("ID", "OD", "stat"), 
    by.y = c("ID", "newOD", "stat"))

   ID OD stat OD.y prod  wgh
1  x1 AE    A   AE  p_2 1000
2  x1 AE    A   AE  p_1 1300
3  x1 AE    B   AE  p_1 1300
4  x1 AE    B   AE  p_2 1000
5  x1 AE    B   BD  p_3  300
6  x1 AE    C   AE  p_1 1300
7  x1 AE    C   BD  p_3  300
8  x1 AE    C   AE  p_2 1000
9  x1 AE    D   AE  p_1 1300
10 x1 AE    D   BD  p_3  300
11 x1 AE    D   AE  p_2 1000
12 x1 AE    E   AE  p_1 1300
13 x1 AE    E   AE  p_2 1000
14 x2 HJ    H   HJ  p_5 2300
15 x2 HJ    H   HJ  p_4 1800
16 x2 HJ    I   HJ  p_5 2300
17 x2 HJ    I   HJ  p_4 1800
18 x2 HJ    J   HJ  p_5 2300
19 x2 HJ    J   HJ  p_4 1800
20 x2 JM    J   JM  p_5 2300
21 x2 JM    K   JM  p_5 2300
22 x2 JM    L   JM  p_5 2300
23 x2 JM    M   JM  p_5 2300

You can then follow Tom Hoel's answer for the reshaping (although consider whether this long format may actually be more useful, instead of creating values of 0 where you are missing data)

but it does not solve the problem of OD pair "BD", the weight of 300, does it? — Rudolf Nyitray, Aug 02 '22 at 14:54
thanks a lot. Unfortunatelly, I have plenty of combinations, doing in manually is not possible. — Rudolf Nyitray, Aug 03 '22 at 07:29
also, what if my OD pairs are not alphabetical letters, but instead real station names? Thus 'LETTERS' function is not possible. thanks! — Rudolf Nyitray, Aug 03 '22 at 07:52
We can’t solve problems we don’t know about. LETTERS is just a built-in array; you could define your own STATIONS array like `c("Cockfosters", "Oakwood", "Southgate")`. For the sub-pattern matching, I recommend you start a separate question. — Ottie, Aug 03 '22 at 08:36
could you help me with the regex method instead of the manual translations? So I basically have a large number of possible values. Thanks! — Rudolf Nyitray, Aug 03 '22 at 13:42
thank you @Ottie, I just noticed your edited answer. Really appreciate, I am still trying with subsets of the whole data, because it would take a long time for the whole to run it. — Rudolf Nyitray, Aug 08 '22 at 13:32