0

I have data that's comprised of several "chunks" of columns. In each chunk, the columns names are the same but distinguished by a prefix: A_color, B_color, C_color, XX_height, YY_height, ZZ_height, etc. Sometimes, there might be just the stem name without any prefix. Also, in each chunk there's only one column with data and the others are NA:

######### CHUNK OF COLOR ########  ## ###### CHUNK OF HEIGHT #############
##   A_color B_color C_color color ## XX_height height YY_height ZZ_height
##   <lgl>   <lgl>     <dbl> <lgl> ## <lgl>     <lgl>  <lgl>         <dbl>
## 1 NA      NA            1 NA    ## NA        NA     NA               15

Therefore, I want to coalesce the chunks to get:

##   color height
##   <dbl>  <dbl>
## 1     1     15

Example

I've built a dumb function to simulate the structure of the data. Its internals don't matter, just the output.

library(stringi)
library(dplyr)
library(purrr)

generate_data <-
  function(n_of_chunks, n_variants_in_each_chunk, nrows) {
    
    # helper func for later
    interweave_vecs <- function(x, y, n){
      c(x, y)[order(c(ceiling(seq_along(x) / n), seq_along(y)))]}
    
    # generate col names
    names_core           <- stri_rand_strings(n = n_of_chunks, length = 10)
    names_expanded       <- rep(names_core, each = n_variants_in_each_chunk)
    prefixes             <- replicate(expr = stri_rand_strings(n = 1, length = sample(1:3, 1)), n = length(names_expanded))
    names_with_prefixes  <- paste(prefixes, names_expanded, sep = "__")
    names_final          <- interweave_vecs(names_with_prefixes, names_core, n_variants_in_each_chunk)
    
    # generate data
    dat <-
      rerun(
      replicate(expr = sample(c(rep(NA, n_variants_in_each_chunk), runif(1))), n = n_of_chunks) %>% 
        as.vector() %>% 
        setNames(., names_final) %>% 
        bind_rows(), 
      .n = nrows
    ) %>%
      bind_rows() %>%
      cbind(., col_blah = rnorm(nrows), col_foo = letters[1:nrows])
    
    # shuffle columns
    dat[,sample(ncol(dat))] %>%
      as_tibble()
    
}

Now simulating the data:

set.seed(2021)
my_df <- generate_data(n_of_chunks = 3, n_variants_in_each_chunk = 3, nrows = 10)
 
> my_df
## # A tibble: 10 x 14
##    v__RmiNdhdGoy s13__1pbZoFVrxX Qx8__1pbZoFVrxX RmiNdhdGoy col_blah Xe__8xOGZuvswo `8xOGZuvswo` A__1pbZoFVrxX w__RmiNdhdGoy col_foo WM__8xOGZuvswo `1pbZoFVrxX` SNr__8xOGZuvswo tYD__RmiNdhdGoy
##            <dbl>           <dbl>           <dbl>      <dbl>    <dbl>          <dbl>        <dbl>         <dbl>         <dbl> <chr>            <dbl>        <dbl>           <dbl>           <dbl>
##  1       NA              NA               NA        NA        -1.50          NA           NA             0.846         0.763 a               NA           NA              0.0479          NA    
##  2        0.0729         NA               NA        NA         2.20           0.365       NA            NA            NA     b               NA            0.995         NA               NA    
##  3       NA              NA               NA         0.398     0.939         NA           NA            NA            NA     c                0.695        0.234         NA               NA    
##  4       NA               0.541           NA        NA         0.713         NA           NA            NA             0.235 d               NA           NA              0.470           NA    
##  5       NA               0.317           NA        NA        -0.805         NA            0.163        NA            NA     e               NA           NA             NA                0.506
##  6       NA              NA                0.991     0.0519    0.791         NA            0.888        NA            NA     f               NA           NA             NA               NA    
##  7        0.159          NA               NA        NA         0.116         NA           NA             0.223        NA     g               NA           NA              0.260           NA    
##  8        0.442           0.0561          NA        NA        -1.33          NA           NA            NA            NA     h                0.663       NA             NA               NA    
##  9       NA              NA                0.346     0.479    -1.82          NA            0.193        NA            NA     i               NA           NA             NA               NA    
## 10       NA              NA                0.443     0.469    -0.403         NA           NA            NA            NA     j               NA           NA              0.450           NA    

In my_df, the columns are organized around 3 "stems":

  • RmiNdhdGoy
  • 1pbZoFVrxX
  • 8xOGZuvswo

Each of which has 3 prefixed versions, and one version which is just the stem, so for example, the RmiNdhdGoy chunk:

my_df %>% 
  select(ends_with("RmiNdhdGoy"))

## # A tibble: 10 x 4
##    v__RmiNdhdGoy RmiNdhdGoy w__RmiNdhdGoy tYD__RmiNdhdGoy
##            <dbl>      <dbl>         <dbl>           <dbl>
##  1       NA         NA              0.763          NA    
##  2        0.0729    NA             NA              NA    
##  3       NA          0.398         NA              NA    
##  4       NA         NA              0.235          NA    
##  5       NA         NA             NA               0.506
##  6       NA          0.0519        NA              NA    
##  7        0.159     NA             NA              NA    
##  8        0.442     NA             NA              NA    
##  9       NA          0.479         NA              NA    
## 10       NA          0.469         NA              NA 
  • The prefix itself is alphanumeric in length between 1-3 characters.
  • The separator between prefix and stem is two underscores (__)

I'm trying to come up with a way that would go over column names, and for the ones that have two underscores, take the substring after that separator, and look for other column names that end with that substring, then consider all those as a chunk of columns and coalesce them.

Desired output

I need to coalesce each chunk to get:

## # A tibble: 10 x 5
##    RmiNdhdGoy `1pbZoFVrxX` `8xOGZuvswo` col_blah col_foo
##         <dbl>        <dbl>        <dbl>    <dbl> <chr>  
##  1     0.763        0.846        0.0479   -1.50  a      
##  2     0.0729       0.995        0.365     2.20  b      
##  3     0.398        0.234        0.695     0.939 c      
##  4     0.235        0.541        0.470     0.713 d      
##  5     0.506        0.317        0.163    -0.805 e      
##  6     0.0519       0.991        0.888     0.791 f      
##  7     0.159        0.223        0.260     0.116 g      
##  8     0.442        0.0561       0.663    -1.33  h      
##  9     0.479        0.346        0.193    -1.82  i      
## 10     0.469        0.443        0.450    -0.403 j   

I've seen a bunch of SO posts about this topic (e.g., this and this and this), but I don't understand how those solutions fit my situation. I hope someone would have an idea. Thanks!


EDIT


Following @tamtam's answer below, I'd like to emphasize that the solution I'm looking for is for any output of generate_data(), and not specifically to my_df. I've showed my_df just to illustrate one exemplar data to tackle.

Emman
  • 3,695
  • 2
  • 20
  • 44

2 Answers2

3

You can try the contains function inside coalesce therefore you don't need to split the column names into before _ and after_. Instead you are just looking for columns that have for example the word "RmiNdhdGoy" inside their name.

In short you tell R to create a new column based on all columns already including a certain expression.

df2 <- my_df %>% 
  mutate(RmiNdhdGoy = coalesce(!!!select(., contains("RmiNdhdGoy"))),
         `1pbZoFVrxX` = coalesce(!!!select(., contains("1pbZoFVrxX"))),
          `8xOGZuvswo` = coalesce(!!!select(., contains("8xOGZuvswo")))) %>%
  select(RmiNdhdGoy, `1pbZoFVrxX`, `8xOGZuvswo`, col_blah, col_foo)

Output

# A tibble: 10 x 5
   RmiNdhdGoy `1pbZoFVrxX` `8xOGZuvswo` col_blah col_foo
        <dbl>        <dbl>        <dbl>    <dbl> <fct>  
 1     0.763        0.846        0.0479   -1.16  a      
 2     0.0729       0.995        0.365    -0.594 b      
 3     0.398        0.234        0.695    -0.583 c      
 4     0.517        0.632        0.432    -0.819 d      
 5     0.0462       0.652        0.213    -1.50  e      
 6     0.947        0.408        0.0811    2.20  f      
 7     0.495        0.0819       0.713     0.939 g      
 8     0.0151       0.459        0.0561    0.713 h      
 9     0.663        0.479        0.346    -0.805 i      
10     0.193        0.469        0.443     0.791 j     
tamtam
  • 3,541
  • 1
  • 7
  • 21
  • thanks. Sorry if I didn't make it clear, but I'm looking for a solution that is not bound to specific names. Every time you run `generate_data()` you get different column names and prefixes. So I need a flexible solution that could handle *any* output of `generate_data()`. – Emman Mar 06 '21 at 12:13
  • What is the `!!!` denoting here? Is there no wrapper function that allows using `contains()` directly within `coalesce()`? I stumbled upon `dplyr::pick()` page but the function doesn't seem to exist. – Karthik Thrikkadeeri Mar 22 '23 at 04:47
1

Just extending the previous answer to remove the hard-coded column names and wrap a solution in a function:

transform_df <- function(start_df_, sep = '__'){
  
  # identify columns of interest
  spread_names <- grep(sep, colnames(start_df_), value = T)
  other_names <- setdiff(colnames(start_df_), spread_names)
  
  # grab the `key` part from their names
  spread_keys <- unique(
    sapply(
      strsplit(spread_names, sep)
      , `[[`
      , 2
    )
  )

  # apply the describe operation
  for(key in spread_keys){
    start_df_ <- start_df_ %>% 
      dplyr::mutate(!!key := coalesce(!!!select(., contains(key))))
  }
  
  # format and return
  start_df_ %>%
    dplyr::select(c(spread_keys, other_names))
}

and then you get the output

set.seed(2022)
my_df <- generate_data(n_of_chunks = 4, n_variants_in_each_chunk = 4, nrows = 10)
my_df %>% transform_df()
# A tibble: 10 x 6
   `098Wb7lVaq` oe7XBd42Mk TdFPA4qYnl qZQ8AvXzTl col_foo col_blah
          <dbl>      <dbl>      <dbl>      <dbl> <chr>      <dbl>
 1       0.0846      0.746      0.381     0.182  a         -0.583
 2       0.509       0.492      0.962     0.702  b         -0.198
 3       0.155       0.912      0.917     0.0141 c         -0.603
 4       0.998       0.603      0.121     0.943  d         -1.09 
 5       0.300       0.853      0.317     0.740  e          0.184
 6       0.0486      0.598      0.330     0.122  f          1.31 
 7       0.0583      0.757      0.650     0.465  g         -0.168
 8       0.819       0.461      0.165     0.597  h          0.344
 9       0.541       0.280      0.978     0.793  i          0.376
10       0.850       0.147      0.865     0.426  j         -0.195   

HTH

Plamen Petrov
  • 317
  • 1
  • 5
  • Thanks, it's great! Small thing though: how can we ensure that `grep(sep, colnames(start_df_), value = T)` matches *exactly* `sep == "__"` and not also `sep == "___" ` nor colnames that include more underscores? – Emman Mar 06 '21 at 15:01
  • I've tried adding a fixer to `sep` before it goes to `grep()`: `sep <- paste0("\\b", sep, "\\b")`, but it didn't work... – Emman Mar 06 '21 at 15:05
  • 1
    hi, `grep` uses regex-es, assuming the column structure you described above `sep = '[a-zA-Z0-9]+_{2}[a-zA-Z0-9]+'`, where `[a-zA-Z0-9]` would match letters / digits, `'+'` ensures that some of those letters / digits appear at least one time and `_{2}` matches exactly 2 underscores. Check `?regex` for more details. HTH – Plamen Petrov Mar 06 '21 at 16:24