I have data that's comprised of several "chunks" of columns. In each chunk, the columns names are the same but distinguished by a prefix: A_color
, B_color
, C_color
, XX_height
, YY_height
, ZZ_height
, etc. Sometimes, there might be just the stem name without any prefix. Also, in each chunk there's only one column with data and the others are NA
:
######### CHUNK OF COLOR ######## ## ###### CHUNK OF HEIGHT #############
## A_color B_color C_color color ## XX_height height YY_height ZZ_height
## <lgl> <lgl> <dbl> <lgl> ## <lgl> <lgl> <lgl> <dbl>
## 1 NA NA 1 NA ## NA NA NA 15
Therefore, I want to coalesce the chunks to get:
## color height
## <dbl> <dbl>
## 1 1 15
Example
I've built a dumb function to simulate the structure of the data. Its internals don't matter, just the output.
library(stringi)
library(dplyr)
library(purrr)
generate_data <-
function(n_of_chunks, n_variants_in_each_chunk, nrows) {
# helper func for later
interweave_vecs <- function(x, y, n){
c(x, y)[order(c(ceiling(seq_along(x) / n), seq_along(y)))]}
# generate col names
names_core <- stri_rand_strings(n = n_of_chunks, length = 10)
names_expanded <- rep(names_core, each = n_variants_in_each_chunk)
prefixes <- replicate(expr = stri_rand_strings(n = 1, length = sample(1:3, 1)), n = length(names_expanded))
names_with_prefixes <- paste(prefixes, names_expanded, sep = "__")
names_final <- interweave_vecs(names_with_prefixes, names_core, n_variants_in_each_chunk)
# generate data
dat <-
rerun(
replicate(expr = sample(c(rep(NA, n_variants_in_each_chunk), runif(1))), n = n_of_chunks) %>%
as.vector() %>%
setNames(., names_final) %>%
bind_rows(),
.n = nrows
) %>%
bind_rows() %>%
cbind(., col_blah = rnorm(nrows), col_foo = letters[1:nrows])
# shuffle columns
dat[,sample(ncol(dat))] %>%
as_tibble()
}
Now simulating the data:
set.seed(2021)
my_df <- generate_data(n_of_chunks = 3, n_variants_in_each_chunk = 3, nrows = 10)
> my_df
## # A tibble: 10 x 14
## v__RmiNdhdGoy s13__1pbZoFVrxX Qx8__1pbZoFVrxX RmiNdhdGoy col_blah Xe__8xOGZuvswo `8xOGZuvswo` A__1pbZoFVrxX w__RmiNdhdGoy col_foo WM__8xOGZuvswo `1pbZoFVrxX` SNr__8xOGZuvswo tYD__RmiNdhdGoy
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 NA NA NA NA -1.50 NA NA 0.846 0.763 a NA NA 0.0479 NA
## 2 0.0729 NA NA NA 2.20 0.365 NA NA NA b NA 0.995 NA NA
## 3 NA NA NA 0.398 0.939 NA NA NA NA c 0.695 0.234 NA NA
## 4 NA 0.541 NA NA 0.713 NA NA NA 0.235 d NA NA 0.470 NA
## 5 NA 0.317 NA NA -0.805 NA 0.163 NA NA e NA NA NA 0.506
## 6 NA NA 0.991 0.0519 0.791 NA 0.888 NA NA f NA NA NA NA
## 7 0.159 NA NA NA 0.116 NA NA 0.223 NA g NA NA 0.260 NA
## 8 0.442 0.0561 NA NA -1.33 NA NA NA NA h 0.663 NA NA NA
## 9 NA NA 0.346 0.479 -1.82 NA 0.193 NA NA i NA NA NA NA
## 10 NA NA 0.443 0.469 -0.403 NA NA NA NA j NA NA 0.450 NA
In my_df
, the columns are organized around 3 "stems":
- RmiNdhdGoy
- 1pbZoFVrxX
- 8xOGZuvswo
Each of which has 3 prefixed versions, and one version which is just the stem, so for example, the RmiNdhdGoy
chunk:
my_df %>%
select(ends_with("RmiNdhdGoy"))
## # A tibble: 10 x 4
## v__RmiNdhdGoy RmiNdhdGoy w__RmiNdhdGoy tYD__RmiNdhdGoy
## <dbl> <dbl> <dbl> <dbl>
## 1 NA NA 0.763 NA
## 2 0.0729 NA NA NA
## 3 NA 0.398 NA NA
## 4 NA NA 0.235 NA
## 5 NA NA NA 0.506
## 6 NA 0.0519 NA NA
## 7 0.159 NA NA NA
## 8 0.442 NA NA NA
## 9 NA 0.479 NA NA
## 10 NA 0.469 NA NA
- The prefix itself is alphanumeric in length between 1-3 characters.
- The separator between prefix and stem is two underscores (
__
)
I'm trying to come up with a way that would go over column names, and for the ones that have two underscores, take the substring after that separator, and look for other column names that end with that substring, then consider all those as a chunk of columns and coalesce them.
Desired output
I need to coalesce each chunk to get:
## # A tibble: 10 x 5
## RmiNdhdGoy `1pbZoFVrxX` `8xOGZuvswo` col_blah col_foo
## <dbl> <dbl> <dbl> <dbl> <chr>
## 1 0.763 0.846 0.0479 -1.50 a
## 2 0.0729 0.995 0.365 2.20 b
## 3 0.398 0.234 0.695 0.939 c
## 4 0.235 0.541 0.470 0.713 d
## 5 0.506 0.317 0.163 -0.805 e
## 6 0.0519 0.991 0.888 0.791 f
## 7 0.159 0.223 0.260 0.116 g
## 8 0.442 0.0561 0.663 -1.33 h
## 9 0.479 0.346 0.193 -1.82 i
## 10 0.469 0.443 0.450 -0.403 j
I've seen a bunch of SO posts about this topic (e.g., this and this and this), but I don't understand how those solutions fit my situation. I hope someone would have an idea. Thanks!
EDIT
Following @tamtam's answer below, I'd like to emphasize that the solution I'm looking for is for any output of generate_data()
, and not specifically to my_df
. I've showed my_df
just to illustrate one exemplar data to tackle.