R How to compute a unique index for two character variables with varing columns?

Question

I'm not sure if I phrased my question properly, so let me give an simplified example:

Given a dataset as follows:

dat <- data_frame(X = c("A", "B", "B", "C", "A"), 
                  Y = c("B", "A", "C", "A", "C"))

how can I compute a pair variable, so that it represents whatever was within X and Y at a given row BUT not generating duplicates, as here:

dat$pair <- c("A-B", "A-B", "B-C", "C-A", "C-A")
dat
# A tibble: 5 × 3
  X     Y     pair 
  <chr> <chr> <chr>
1 A     B     A-B  
2 B     A     A-B  
3 B     C     B-C  
4 C     A     C-A  
5 A     C     C-A

I can compute a pairing with paste0 but it will indroduce duplicates (C-A is the same as A-C for me) that I want to avoid:

> dat <- mutate(dat, pair = paste0(X, "-", Y))
> dat
# A tibble: 5 × 3
  X     Y     pair 
  <chr> <chr> <chr>
1 A     B     A-B  
2 B     A     B-A  
3 B     C     B-C  
4 C     A     C-A  
5 A     C     A-C

Does your data only include upper case letters and one letter in each element? — Peter, Aug 27 '21 at 18:22
@blazej Does the order matter? For example, would `A-C` and `A-C` be acceptable instead of `C-A` and `C-A`? — Ben, Aug 27 '21 at 18:26
@Peter, no - its actually longer strigs with multiple chatacters — blazej, Aug 28 '21 at 08:13

score 3 · Accepted Answer · answered Aug 28 '21 at 11:23

3

We can use pmin and pmax to sort the values parallely and paste them.

transform(dat, pair = paste(pmin(X, Y), pmax(X, Y), sep = '-'))

#  X Y pair
#1 A B  A-B
#2 B A  A-B
#3 B C  B-C
#4 C A  A-C
#5 A C  A-C

If you prefer dplyr this can be written as -

library(dplyr)

dat %>% mutate(pair = paste(pmin(X, Y), pmax(X, Y), sep = '-'))

answered Aug 28 '21 at 11:23

Ronak Shah

377,200
20
156
213

All solutions presented here are nice, but this is the real deal :) I've been meaning to ask when we apply `pmax` or `pmin` on a data set it is applied on every row, is it correct? – Anoushiravan R Aug 28 '21 at 11:35

Peter · Answer 2 · 2021-08-28T09:12:58.090

2

With dplyr and tidyr you could try:

library(dplyr)
library(tidyr)

dat %>% 
  rowwise() %>% 
  mutate(pair = list(c(X, Y)),
         pair = list(sort(pair)),
         pair = list(paste(pair, collapse = "-"))) %>% 
  select(pair) %>% 
  distinct() %>% 
  unnest(pair)
#> # A tibble: 3 x 1
#>   pair 
#>   <chr>
#> 1 A-B  
#> 2 B-C  
#> 3 A-C

^{Created on 2021-08-27 by the reprex package (v2.0.0)}

data

dat <- data.frame(X = c("A", "B", "B", "C", "A"), 
                  Y = c("B", "A", "C", "A", "C"))

edited Aug 28 '21 at 09:12

answered Aug 27 '21 at 20:07

Peter

11,500
5
21
31

Thanks @Peter, I chose @Samet response as it returns all the columns and not just the pairing. BTW, there is a comma missing after `pair = list(sort(pair))` in your code :) – blazej Aug 28 '21 at 09:07
Thanks for the feedback. Have added the comma, my omission. If you want all pairs just remove the `distinct()` argument. My reading of your question was that you wanted to "avoid duplicate pairs". – Peter Aug 28 '21 at 09:15

score 2 · Answer 3 · answered Aug 27 '21 at 20:08

I reordered each column once

dat <- data.frame(X = c("A", "B", "B", "C", "A"), 
                  Y = c("B", "A", "C", "A", "C"))

library(dplyr)


dat %>%
rowwise %>%
mutate(pair = paste0(sort(c(as.character(X),as.character(Y)),decreasing = F),collapse = '-')) %>%
ungroup

output;

X     Y     pair 
  <fct> <fct> <chr>
1 A     B     A-B  
2 B     A     A-B  
3 B     C     B-C  
4 C     A     A-C  
5 A     C     A-C

R How to compute a unique index for two character variables with varing columns?

3 Answers3