How to split bigrams into column and row pairs for n-columns

Question

Supposing a dataframe like this:

# example dataset
df <- data.frame(
         rowid = 1:3,
         a = c("ax","cz","by"),
         b = c("cy","ax","bz"),
         c = c("bz","ay","cx")
      )

What would an efficient approach be to achieving the following transformation?

#> # A tibble: 3 x 4
#>  rowid      a       b       c
#>  <int>  <chr>   <chr>   <chr>
#>      1      x       z       y
#>      2      x       y       z
#>      3      y       z       x

The goal is to take the second character of each bigram and sort it into columns picked-out by the first character, for each row.

If possible, it would be useful to compare base R and Tidyverse solutions.

score 1 · Answer 1 · answered Feb 04 '22 at 00:09

A Tidyverse solution partially inspired by this recent post using rotate_df() from the sjmisc package: https://stackoverflow.com/a/70682560/8068516

df <- data.frame(
     rowid = 1:3,
     a = c("ax","cz","by"),
     b = c("cy","ax","bz"),
     c = c("bz","ay","cx")
  )

library(sjmisc)
df %>%
  # transpose the dataframe keeping column names
  rotate_df(cn=TRUE) %>%
  # sort columns by first character
  mutate(across(everything(),sort)) %>%
  # transpose back
  rotate_df() %>%
  # remove first character from each string
  mutate(across(everything(),~str_sub(.,2,-1))) %>%
  # make `rowid` column
  rownames_to_column(var="rowid")

The dataframe can optionally be turned into tibble with as_tibble() to exactly match the target output, giving:

#> # A tibble: 3 x 4
#>  rowid       a        b        c
#>  <int>   <chr>    <chr>    <chr>
#>      1       x        z        y
#>      2       x        y        z
#>      3       y        z        x

This solution will generalise to n-columns and is %>% compatible.

thelatemail · Accepted Answer · 2022-02-04T05:24:42.100

Since you were looking for a comparison of base and Tidyverse, I'll chime in with a base solution:

tdf <- t(df[-1])
tdf[] <- substr(tdf, 2, 2)[order(col(tdf), tdf)]
df[-1] <- t(tdf)

#  rowid a b c
#1     1 x z y
#2     2 x y z
#3     3 y z x

To explain the 3 steps:

.1) take a copy of a t()ransposed version of the data
.2a) get the order within each row (col() now since it's transposed) of (implicitly the first letter of) each string
.2b) use this order to select from the second letter of each string and overwrite <- the transposed data.
.3) t()ranspose back to the original structure and overwrite <- the data in df

Benchmark on 30K rows

Base:

bigdf <- df[rep(1:3,10000),]
bigdf$rowid <- 1:30000

system.time({
    tdf <- t(bigdf[-1])
    tdf[] <- substr(tdf,2,2)[order(col(tdf), tdf)]
    bigdf[-1] <- t(tdf)
})
##   user  system elapsed 
##  0.023   0.000   0.023

Tidy:

bigdf <- df[rep(1:3,10000),]
bigdf$rowid <- 1:30000

library(dplyr)
library(tibble)
library(sjmisc)
library(stringr)

system.time({
    bigdf %>%
        rotate_df(cn=TRUE) %>%
        mutate(across(everything(),sort)) %>%
        rotate_df() %>%
        mutate(across(everything(),~str_sub(.,2,-1))) %>%
        rownames_to_column(var="rowid")
})
##   user  system elapsed 
## 21.177   0.047  21.244

Elegant. Are there any strategies for this that don't use the transposition trick? — louisdesu, Feb 04 '22 at 04:02
@louisdesu - there's probably several other ways to achieve this - reshaping to a long file, reordering, and reshaping back could do it too. But this was the quickest, least complicated way I could come up with. — thelatemail, Feb 04 '22 at 04:05

How to split bigrams into column and row pairs for n-columns

2 Answers2