2

Supposing a dataframe like this:

# example dataset
df <- data.frame(
         rowid = 1:3,
         a = c("ax","cz","by"),
         b = c("cy","ax","bz"),
         c = c("bz","ay","cx")
      )

What would an efficient approach be to achieving the following transformation?

#> # A tibble: 3 x 4
#>  rowid      a       b       c
#>  <int>  <chr>   <chr>   <chr>
#>      1      x       z       y
#>      2      x       y       z
#>      3      y       z       x

The goal is to take the second character of each bigram and sort it into columns picked-out by the first character, for each row.

If possible, it would be useful to compare base R and Tidyverse solutions.

louisdesu
  • 55
  • 6

2 Answers2

1

A Tidyverse solution partially inspired by this recent post using rotate_df() from the sjmisc package: https://stackoverflow.com/a/70682560/8068516

df <- data.frame(
     rowid = 1:3,
     a = c("ax","cz","by"),
     b = c("cy","ax","bz"),
     c = c("bz","ay","cx")
  )

library(sjmisc)
df %>%
  # transpose the dataframe keeping column names
  rotate_df(cn=TRUE) %>%
  # sort columns by first character
  mutate(across(everything(),sort)) %>%
  # transpose back
  rotate_df() %>%
  # remove first character from each string
  mutate(across(everything(),~str_sub(.,2,-1))) %>%
  # make `rowid` column
  rownames_to_column(var="rowid")

The dataframe can optionally be turned into tibble with as_tibble() to exactly match the target output, giving:

#> # A tibble: 3 x 4
#>  rowid       a        b        c
#>  <int>   <chr>    <chr>    <chr>
#>      1       x        z        y
#>      2       x        y        z
#>      3       y        z        x

This solution will generalise to n-columns and is %>% compatible.

louisdesu
  • 55
  • 6
1

Since you were looking for a comparison of base and Tidyverse, I'll chime in with a base solution:

tdf <- t(df[-1])
tdf[] <- substr(tdf, 2, 2)[order(col(tdf), tdf)]
df[-1] <- t(tdf)

#  rowid a b c
#1     1 x z y
#2     2 x y z
#3     3 y z x

To explain the 3 steps:

.1) take a copy of a t()ransposed version of the data
.2a) get the order within each row (col() now since it's transposed) of (implicitly the first letter of) each string
.2b) use this order to select from the second letter of each string and overwrite <- the transposed data.
.3) t()ranspose back to the original structure and overwrite <- the data in df


Benchmark on 30K rows

Base:

bigdf <- df[rep(1:3,10000),]
bigdf$rowid <- 1:30000

system.time({
    tdf <- t(bigdf[-1])
    tdf[] <- substr(tdf,2,2)[order(col(tdf), tdf)]
    bigdf[-1] <- t(tdf)
})
##   user  system elapsed 
##  0.023   0.000   0.023 

Tidy:

bigdf <- df[rep(1:3,10000),]
bigdf$rowid <- 1:30000

library(dplyr)
library(tibble)
library(sjmisc)
library(stringr)

system.time({
    bigdf %>%
        rotate_df(cn=TRUE) %>%
        mutate(across(everything(),sort)) %>%
        rotate_df() %>%
        mutate(across(everything(),~str_sub(.,2,-1))) %>%
        rownames_to_column(var="rowid")
})
##   user  system elapsed 
## 21.177   0.047  21.244 
thelatemail
  • 91,185
  • 12
  • 128
  • 188
  • Elegant. Are there any strategies for this that don't use the transposition trick? – louisdesu Feb 04 '22 at 04:02
  • 1
    @louisdesu - there's probably several other ways to achieve this - reshaping to a long file, reordering, and reshaping back could do it too. But this was the quickest, least complicated way I could come up with. – thelatemail Feb 04 '22 at 04:05