1

I am trying to create unique IDs for two columns. The dataframe has two colums, Sp1 and Sp2, which are character strings. Sp1 and Sp2 can be the same character string. I want to have a unique (Sp1;Sp2) couple identifier that does not take the order of (Sp1;Sp2) into account. For instance, (A;B) should have the same identifier as (B;A).

I tried to use dplyr but it did not work since the sort() function renders the same identifier for each row.

So I used a classical loop, but I have a large dataset and it takes too much time :

for (k in 1:nrow(data)){
 data$Couple[k] <- paste0(
  sort(c(as.character(data$Sp_1[k]), as.character(data$Sp_2[k])))[1],
  "_",
  sort(c(as.character(data$Sp_1[k]), as.character(data$Sp_2[k])))[2])
}

I would also like to have numbers rather than character strings since it would take less memory.

zx8754
  • 52,746
  • 12
  • 114
  • 209
Camchou
  • 95
  • 9
  • It would be easier to help if you create a small reproducible example along with expected output. Read about [how to give a reproducible example](http://stackoverflow.com/questions/5963269). – Ronak Shah May 11 '21 at 11:05
  • Yes, sorry I did not give an example. @BluVoxe did give one in their answer (thanks !!), I will edit my post with it. – Camchou May 11 '21 at 11:16
  • 1
    Related post: https://stackoverflow.com/q/15487151/680068 – zx8754 May 11 '21 at 12:33
  • It is better to keep benchmarking as a Wiki type answer. I edited your question and moved benchmarking to an answer, see below. – zx8754 May 12 '21 at 07:39
  • Thank you @zx8754. I am sorry I did not do it right ! – Camchou May 18 '21 at 17:57

3 Answers3

3

Using pmin and pmax:

data %>% 
  mutate(id1 = paste0(pmin(Sp_1, Sp_2), pmax(Sp_1, Sp_2)),
         id2 = as.integer(as.factor(id1)))

# # A tibble: 10 x 4
#    Sp_1  Sp_2  id1     id2
#    <chr> <chr> <chr> <int>
#  1 a     e     ae       3
#  2 d     e     de       7
#  3 a     b     ab       1
#  4 b     b     bb       4
#  5 e     a     ae       3
#  6 c     e     ce       6
#  7 b     e     be       5
#  8 c     a     ac       2
#  9 c     a     ac       2
# 10 a     e     ae       3

Edit: If we are after efficiency stick with base::transform, see benchmark. It is 5x faster with your example data and 1.5x faster with a bigger data:

# bigger data
set.seed(1); data <- tibble(
  Sp_1 = sample(letters[1:5], 10000, replace = TRUE),
  Sp_2 = sample(letters[1:5], 10000, replace = TRUE)
)

microbenchmark::microbenchmark(
  x1 = {
    data %>% 
      mutate(id = as.integer(as.factor(
        paste0(pmin(Sp_1, Sp_2), pmax(Sp_1, Sp_2)))))
  },
  x2 = {
    transform(data,
              id = as.integer(as.factor(
                paste0(pmin(Sp_1, Sp_2), pmax(Sp_1, Sp_2)))))
    
  }, unit = "relative")

# Unit: relative
# expr      min       lq     mean   median       uq      max neval
#   x1 1.476691 1.457313 1.414833 1.429563 1.303684 2.209446   100
#   x2 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000   100
zx8754
  • 52,746
  • 12
  • 114
  • 209
1

Here is a solution which first uses rowwise() to get a representation of Sp_1 + Sp_2 where the order doesn't matter, and then purrr::map_int() to transform this into a unique id:

# Make this reproducible
set.seed(1)

# Load packages
library(dplyr)
library(purrr)

# Define and inspect a test dataset
data <- tibble(
  Sp_1 = sample(letters[1:5], 10, replace = TRUE),
  Sp_2 = sample(letters[1:5], 10, replace = TRUE)
)

data
#> # A tibble: 10 x 2
#>    Sp_1  Sp_2 
#>    <chr> <chr>
#>  1 a     e    
#>  2 d     e    
#>  3 a     b    
#>  4 b     b    
#>  5 e     a    
#>  6 c     e    
#>  7 b     e    
#>  8 c     a    
#>  9 c     a    
#> 10 a     e

data %>% 
  
  # Add a unique representation of `Sp_1` and `Sp_2` where order doesn't matter
  rowwise() %>% 
  mutate(string = paste(sort(c(Sp_1, Sp_2)), collapse = "")) %>% 
  ungroup() %>% 
  
  # Use `map_int()` to get an integer `id` representation of `string`
  mutate(id = map_int(string, ~which(unique(string) == .)))
#> # A tibble: 10 x 4
#>    Sp_1  Sp_2  string    id
#>    <chr> <chr> <chr>  <int>
#>  1 a     e     ae         1
#>  2 d     e     de         2
#>  3 a     b     ab         3
#>  4 b     b     bb         4
#>  5 e     a     ae         1
#>  6 c     e     ce         5
#>  7 b     e     be         6
#>  8 c     a     ac         7
#>  9 c     a     ac         7
#> 10 a     e     ae         1
wurli
  • 2,314
  • 10
  • 17
  • 1
    Thank you this is just perfect. I had never used purrr before but I think it's a wonderful tool ! Cheers. – Camchou May 11 '21 at 11:14
0

Benchmarking

The reproducible example was given by @BluVoxe. We compare the output and the running times of @BluVoxe's and both @zx8754's solutions :

library(dplyr)
library(purrr)
library(microbenchmark)

# Make this reproducible
set.seed(1)

# Define and inspect a test dataset
data <- tibble(
  Sp_1 = sample(letters[1:5], 10, replace = TRUE),
  Sp_2 = sample(letters[1:5], 10, replace = TRUE)
)

data

# # A tibble: 10 x 2
#    Sp_1  Sp_2 
#    <chr> <chr>
#  1 a     e    
#  2 d     e    
#  3 a     b    
#  4 b     b    
#  5 e     a    
#  6 c     e    
#  7 b     e    
#  8 c     a    
#  9 c     a    
# 10 a     e

#First solution

data1 <- data%>% 
  
  # Add a unique representation of `Sp_1` and `Sp_2` where order doesn't matter
  dplyr::rowwise() %>% 
  dplyr::mutate(string = paste(sort(c(Sp_1, Sp_2)), collapse = "")) %>% 
  dplyr::ungroup() %>% 
  
  # Use `map_int()` to get an integer `id` representation of `string`
  dplyr::mutate(id = purrr::map_int(string, ~which(unique(string) == .)))

data1

# # A tibble: 10 x 4
#   Sp_1  Sp_2  string id
#   <chr> <chr> <chr>  <int>
# 1 a     e     ae     1
# 2 d     e     de     2
# 3 a     b     ab     3
# 4 b     b     bb     4
# 5 e     a     ae     1
# 6 c     e     ce     5
# 7 b     e     be     6
# 8 c     a     ac     7
# 9 c     a     ac     7
# 10 a    e     ae     1


# Second solution

data2 <- data %>% 
  dplyr::mutate(id1 = paste0(pmin(Sp_1, Sp_2), pmax(Sp_1, Sp_2)),
         id2 = as.integer(as.factor(id1)))

data2

# # A tibble: 10 x 4
#   Sp_1  Sp_2  id1   id2
#   <chr> <chr> <chr> <int>
# 1 a     e     ae    3
# 2 d     e     de    7
# 3 a     b     ab    1
# 4 b     b     bb    4
# 5 e     a     ae    3
# 6 c     e     ce    6
# 7 b     e     be    5
# 8 c     a     ac    2
# 9 c     a     ac    2
# 10 a    e     ae    3

# Third solution
data3 <- transform(
  data,
  id = as.integer(
    as.factor(
      paste0(pmin(Sp_1, Sp_2), pmax(Sp_1, Sp_2))
      )
    )
  )

data3

#       Sp_1 Sp_2 id
# 1     a    e    3
# 2     d    e    7
# 3     a    b    1
# 4     b    b    4
# 5     e    a    3
# 6     c    e    6
# 7     b    e    5
# 8     c    a    2
# 9     c    a    2
# 10    a    e    3

 #Compare efficiency :

microbenchmark::microbenchmark(
  x1 = {
    data%>% 
      dplyr::rowwise() %>% 
      dplyr::mutate(string = paste(sort(c(Sp_1, Sp_2)), collapse = "")) %>% 
      dplyr::ungroup() %>%
      dplyr::mutate(id = purrr::map_int(string, ~which(unique(string) == .)))
  },
  x2 = {
    data %>% 
      dplyr::mutate(id = as.integer(as.factor(
        paste0(pmin(Sp_1, Sp_2), pmax(Sp_1, Sp_2)))))
  },
  x3 = {
    transform(data,
              id = as.integer(as.factor(
                paste0(pmin(Sp_1, Sp_2), pmax(Sp_1, Sp_2)))))
    
  },
  unit = "relative")

#Unit: relative
 #expr       min        lq      mean    median        uq      max neval
  # x1 23.329340 24.151001 23.951911 23.710270 22.996736 28.23673   100
   #x2  8.064332  7.785381  8.214726  7.796895  7.741803 19.18936   100
   #x3  1.000000  1.000000  1.000000  1.000000  1.000000  1.00000   100

#With bigger data :

set.seed(1)
data <- tibble(
  Sp_1 = sample(letters[1:5], 10000, replace = TRUE),
  Sp_2 = sample(letters[1:5], 10000, replace = TRUE)
)

microbenchmark::microbenchmark(
  x1 = {
    data%>% 
      dplyr::rowwise() %>% 
      dplyr::mutate(string = paste(sort(c(Sp_1, Sp_2)), collapse = "")) %>% 
      dplyr::ungroup() %>% 
      dplyr::mutate(id = purrr::map_int(string, ~which(unique(string) == .)))
  },
  x2 = {
    data %>% 
      dplyr::mutate(id = as.integer(as.factor(
        paste0(pmin(Sp_1, Sp_2), pmax(Sp_1, Sp_2)))))
  },
  x3 = {
    transform(data,
              id = as.integer(as.factor(
                paste0(pmin(Sp_1, Sp_2), pmax(Sp_1, Sp_2)))))
    
  }, unit = "relative")

#Unit: relative
 #expr        min         lq       mean     median         uq        max neval
   #x1 524.626924 512.590748 506.051098 515.687843 521.642359 418.635195   100
   #x2   1.503782   1.514021   1.577941   1.559449   1.620967   1.648478   100
   #x3   1.000000   1.000000   1.000000   1.000000   1.000000   1.000000   100

The last solution is the most efficient.

However there are differences : the first and the lat solutions render ID in the same order as they appear in the dataset, while the second solution renders ID in the alphabetical order of the couple.

zx8754
  • 52,746
  • 12
  • 114
  • 209