Ways to full_join with alternating "non-joined duplicate variables" instead of .x,.x,.x,.y,.y,.y format, and corresponding calculation

Question

library(dplyr)
x1 <- tibble(ID = rep(1,4),
             Measure = c("Height","Weight","Width","Length"),
             AD1= rpois(4,10),
             AD2= rpois(4,9))
x2 <- tibble(ID = rep(2,4),
             Measure = c("Height","Weight","Width","Length"),
             AD1= rpois(4,20),
             AD2= rpois(4,19))

xdiff <- x1 %>%
  full_join(x2,by="Measure",
            suffix=c("_1","_2")) %>%
  mutate(diff1 = AD1_1-AD1_2,
         diff2 = AD2_1-AD2_2)

   ID_1 Measure AD1_1 AD2_1  ID_2 AD1_2 AD2_2 diff1 diff2
  <dbl> <chr>   <int> <int> <dbl> <int> <int> <int> <int>
1     1 Height     10    12     2    20    15   -10    -3
2     1 Weight      7    10     2    11    16    -4    -6
3     1 Width      12    11     2    22    17   -10    -6
4     1 Length     11     9     2    24    16   -13    -7

xdesire <- xdiff %>%
  select(Measure,AD1_1,AD1_2,diff1,AD2_1,AD2_2,diff2)

  Measure AD1_1 AD1_2 diff1 AD2_1 AD2_2 diff2
  <chr>   <int> <int> <int> <int> <int> <int>
1 Height     10    20   -10    12    15    -3
2 Weight      7    11    -4    10    16    -6
3 Width      12    22   -10    11    17    -6
4 Length     11    24   -13     9    16    -7

full_join(), with default arguments, will add suffixes to the duplicate variables in sequential order of x,x,x,y,y,y. I could technically use select() to order the columns as desired, but if the data have large amount of columns, e.g., AD1:AD200, it would be very inefficient.

dply: order columns alphabetically in R shows method of order the columns alphabetically, but in my case, there is "difference" column that is needed between every variables (AD1,diff1,AD2,diff).

I was wondering if full_join() allows duplicate variables to be alternating, and if I could directly add the calculations with them through join() function.

Of course, other methods, such as more advanced select() argument is welcomed too.

Maybe `starts_with()`? https://tidyselect.r-lib.org/reference/select_helpers.html — TTS, Jan 14 '21 at 21:19
@akrun I was planning to make a new column with string entries like "1 vs 2", but reasoned it would just complicate the motivating example I created for the purpose of my question — aiorr, Jan 14 '21 at 21:54
@aiorr Can you please add an example that is more representative so that it can be tested. My solution is based on the example you posted — akrun, Jan 14 '21 at 22:01
@akrun I will if your answer cease to work in different situations (it's working for the purpose of my current work), as a different post. I accepted your answer for this post — aiorr, Jan 15 '21 at 14:48

score 1 · Accepted Answer · answered Jan 14 '21 at 21:22

1

An option is to order on the first numeric substring extracted with parse_number on the column names

library(dplyr)
xdiff %>%
  select(Measure, order(readr::parse_number(names(.))),  -starts_with("ID"))

-output

# A tibble: 4 x 7
#  Measure AD1_1 AD1_2 diff1 AD2_1 AD2_2 diff2
#  <chr>   <int> <int> <int> <int> <int> <int>
#1 Height     10    25   -15     5    20   -15
#2 Weight     16    28   -12    13    15    -2
#3 Width      12    19    -7     9    25   -16
#4 Length     10    29   -19     4    23   -19

answered Jan 14 '21 at 21:22

akrun

874,273
37
540
662

question: above codes will result in `1 parsing failures.` since the variable `Measure` does not have number in it. While it doesn't cause any meaningful problem (since we explicitly state `Measure` as our first select() argument), what can we do to prevent it? I am guessing we have to do something about `names(.)`? – aiorr Jan 22 '21 at 18:14
1

@aiorr You may use `select(Measure, order(readr::parse_number(names(.)[-1])) + 1, -starts_with("ID"))`. I guess it is a warning and not an error right? – akrun Jan 22 '21 at 20:24
yes, just a warning, not an error. Is it possible to do something along the line of `select(Measure, order(readr::parse_number(names(.)[-Measure])), -starts_with("ID"))` ? (above line doesn't work, of course) – aiorr Jan 22 '21 at 21:36
1

@aiorr You can try `xdiff %>% select(Measure, order(readr::parse_number(setdiff(names(.), 'Measure'))) + 1, -starts_with("ID"))` – akrun Jan 23 '21 at 17:55

score 1 · Answer 2 · answered Jan 17 '22 at 18:37

You can use {powerjoin} to nest the conflicting columns together with the created diff column, then unpack all of them:

library(powerjoin)
library(tidyverse)
x1 <- tibble(Measure = c("Height","Weight","Width","Length"),
             AD1= rpois(4,10),
             AD2= rpois(4,9))
x2 <- tibble(Measure = c("Height","Weight","Width","Length"),
             AD1= rpois(4,20),
             AD2= rpois(4,19))

power_left_join(
  x1, x2, by = "Measure", conflict = ~ tibble("1" = .x, "2" = .y, "diff" = .x - .y)) %>% 
  unpack(where(is.data.frame), names_sep = "_")
#> # A tibble: 4 × 7
#>   Measure AD1_1 AD1_2 AD1_diff AD2_1 AD2_2 AD2_diff
#>   <chr>   <int> <int>    <int> <int> <int>    <int>
#> 1 Height     16    27      -11     6    17      -11
#> 2 Weight      7    14       -7    10    21      -11
#> 3 Width      14    21       -7    16    22       -6
#> 4 Length     11    23      -12     6    21      -15

Ways to full_join with alternating "non-joined duplicate variables" instead of .x,.x,.x,.y,.y,.y format, and corresponding calculation

2 Answers2