Apply a function to a subset of many columns in R

Question

How do I apply a function to many columns of grouped rows? For example;

library(tidyverse)
data <- tribble(
  ~Date,      ~Seq1, ~Component, ~Seq2,  ~X1,  ~X2,   ~X3,   
  "01/01/18", 1,     "Smooth",   NA,     3.98,  2.75,  1.82, 
  "01/01/18", 2,     "Smooth",   NA,     1.02,  0.02, -0.04, 
  "01/01/18", 3,     "Smooth",   NA,     3.48,  3.06,  1.25, 
  "01/01/18", 3,     "Bounce",   1,      2.01, -0.43, -0.52, 
  "01/01/18", 3,     "Bounce",   2,      1.94,  1.53,  1.92) %>%
mutate_at(vars(Date, Seq1, Component, Seq2), funs(factor))

Each column of X values (many more columns, truncated here for clarity) is grouped into Date, Seq1, Component, and Seq2. While Component "Smooth" and Seq1 "NA" are constant, within Component "Bounce" level there are multiple Seq2 levels e.g. "1", "2", etc.

How do I sum each X column, always the constant "NA" with each level of Seq2?

The desired results is:

expected <- tribble(
~Date,      ~Seq1, ~Component, ~Seq2,  ~X1,  ~X2,   ~X3,   
"01/01/18", 1,     "Smooth",   NA,     3.98,  2.75,  1.82, 
"01/01/18", 2,     "Smooth",   NA,     1.02,  0.02, -0.04, 
"01/01/18", 3,     "Smooth",   NA,     3.48,  3.06,  1.25, 
"01/01/18", 3,     "Bounce",   1,      5.49,  3.49,  1.77, 
"01/01/18", 3,     "Bounce",   2,      5.42,  4.59,  3.17)

The following example only adds each Seq1 level.

data %>% 
  group_by(Date, Seq1) %>%
  mutate_at(vars(starts_with("X")), funs(sum(.)))
#> # A tibble: 5 x 7
#> # Groups:   Date, Seq1 [3]
#>   Date     Seq1  Component  Seq2    X1    X2    X3
#>   <fct>    <fct> <fct>     <fct> <dbl> <dbl> <dbl>
#> 1 01/01/18 1     Smooth    <NA>   3.98  2.75  1.82
#> 2 01/01/18 2     Smooth    <NA>   1.02  0.02 -0.04
#> 3 01/01/18 3     Smooth    <NA>   7.43  4.16  2.65
#> 4 01/01/18 3     Bounce    1      7.43  4.16  2.65
#> 5 01/01/18 3     Bounce    2      7.43  4.16  2.65

I am certain there is solution within the purrr or apply function family, however, I have been unsuccessful (for days) in solving this example. The actual data has about 180 X columns, with hundreds of Date and Seq1 combinations, and multiple Seq2 levels.

A similar example could be Summing Multiple Groups of Columns, How to apply a function to a subset of columns in r?, or even perhaps https://github.com/jennybc/row-oriented-workflows.

^{Created on 2018-10-23 by the reprex package (v0.2.1)}

Can you provide your expected output from this data? I do not understand what "How do I sum each X column, always the constant "NA" with each level of Seq2?" means. Why is your attempt not correct? — Calum You, Oct 23 '18 at 03:47
Perhaps `data %>% mutate(Sum = rowSums(.[grep("^X\\d+", names(.))]))` — akrun, Oct 23 '18 at 03:49
My attempt is incorrect because it sums each column only at the Seq1 level e.g. X1 row 3-5 are the same, rather than in the desired results. — statoconial membrane, Oct 23 '18 at 04:16
It is still unclear to me how do you get to `5.49, 3.49, 1.77` and `5.42, 4.59, 3.17` — deann, Oct 27 '18 at 11:26
@deann it is through [row, column]: `[3, X1] + [4, X1], [3, X2] + [4, X2], [3, X3] + [4, X3]` and `[3, X1] + [5, X1], [3, X2] + [5, X2], [3, X3] + [5, X3]`. Note how, for each X column, Component == "smooth" is always added to each Component == "Bounce". That is, when there is a Bounce component (per sequence in Seq2) add the Smooth component. Later, the X1, X2, X3, etc. will be used to plot a series of lines. — statoconial membrane, Oct 29 '18 at 06:54
Really tough to understand the meaning of the question, as @CalumYou observes. — Nettle, Oct 30 '18 at 02:31
Is there always only one `Smooth` for each `Seq1`, but many potential `Bounce`? — Calum You, Oct 30 '18 at 02:40
@CalumYou correct; there can be 0-4 potential `Bounce` components for each `Seq1`. Up to 4 is the maximum in this data set, however there could be more in other data sets so I don't think that should be hard-coded. I apologise it is not easier to understand… what more can I explain? I am unsure of how to explain the problem. — statoconial membrane, Oct 31 '18 at 02:51

score 0 · Accepted Answer · answered Oct 31 '18 at 08:00

Here's my solution. This problem is not really a purrr task, because there is nothing really that you want to map a single function to. Instead, what I understand the problem to be is that you want to match each X value in a Bounce row with the corresponding Smooth row X values of the same Date and Seq1 (and there is only one such row). This means that it is really a merging or joining problem, and then the approach is to set up the join so that you can match the right values and do the sum. So I go as follows:

Split the data into the Smooth rows and the Bounce rows and gather so that all the X values are in one column
Join the smooths onto the bounces with a left_join, so each original Bounce row now has its corresponding Smooth.
mutate the sum into a new column and select/rename the columns to be as in the original
bind_rows to join the newly summed bounces and spread to return to the original layout.

This should be robust to any number of Date, Seq1, Seq2 and X values.

library(tidyverse)
data <- tribble(
  ~Date,      ~Seq1, ~Component, ~Seq2,  ~X1,  ~X2,   ~X3,   
  "01/01/18", 1,     "Smooth",   NA,     3.98,  2.75,  1.82, 
  "01/01/18", 2,     "Smooth",   NA,     1.02,  0.02, -0.04, 
  "01/01/18", 3,     "Smooth",   NA,     3.48,  3.06,  1.25, 
  "01/01/18", 3,     "Bounce",   1,      2.01, -0.43, -0.52, 
  "01/01/18", 3,     "Bounce",   2,      1.94,  1.53,  1.92)

smooths <- data %>%
  filter(Component == "Smooth") %>%
  gather(X, val, starts_with("X"))

bounces <- data %>%
  filter(Component == "Bounce") %>%
  gather(X, val, starts_with("X")) %>%
  left_join(smooths, by = c("Date", "Seq1", "X")) %>%
  mutate(val = val.x + val.y) %>%
  select(Date, Seq1, Component = Component.x, Seq2 = Seq2.x, X, val)

bounces %>%
  bind_rows(smooths) %>%
  spread(X, val)
#> # A tibble: 5 x 7
#>   Date      Seq1 Component  Seq2    X1    X2    X3
#>   <chr>    <dbl> <chr>     <dbl> <dbl> <dbl> <dbl>
#> 1 01/01/18     1 Smooth       NA  3.98  2.75  1.82
#> 2 01/01/18     2 Smooth       NA  1.02  0.02 -0.04
#> 3 01/01/18     3 Bounce        1  5.49  2.63  0.73
#> 4 01/01/18     3 Bounce        2  5.42  4.59  3.17
#> 5 01/01/18     3 Smooth       NA  3.48  3.06  1.25

^{Created on 2018-10-31 by the reprex package (v0.2.1)}

Great! That works as expected. Although, the last `spread` doesn't seem to respect the key order, so when there is many X columns then the newly spread column order becomes X1, X10, X11, etc. rather than X1, X2, X3... X10, X11. — statoconial membrane, Nov 01 '18 at 00:08
This X sequence mis-arrangement was solved through with `select(Date, Seq1, Component, Seq2, paste("X", seq(1, ncol(select(., starts_with("X"))), 1), sep=""))` — statoconial membrane, Nov 01 '18 at 00:25

Apply a function to a subset of many columns in R

1 Answers1