2

I am working with a set of binary data for months of the year similar to the sample tibble below but with tens of thousands of rows. I am trying to add a column to the existing table that sums each row between the month specified by the column Month and the final month Dec. For example, I would sum columns Nov:Dec = 0 for the first row, May:Dec = 8 for the second, Jan:Dec = 11 for the last, etc.

library(tidyverse)

df <- tribble(
  ~Jan,~Feb,~Mar,~Apr,~May,~Jun,~Jul,~Aug,~Sep,~Oct,~Nov,~Dec,~Month,~Expected,
  1,0,0,0,0,0,0,0,0,0,0,0,"Nov",0,
  0,1,1,1,1,1,1,1,1,1,1,1,"May",8,
  0,0,1,1,1,1,1,1,1,1,1,1,"Aug",5,
  0,0,0,1,1,1,1,1,1,1,1,1,"Jan",9,
  0,0,0,0,1,1,1,1,1,1,1,1,"Dec",1,
  0,0,0,0,0,1,1,1,1,1,1,1,"May",7,
  0,0,0,0,0,0,1,1,1,1,1,1,"Jun",6,
  0,0,0,0,0,0,0,1,1,1,1,1,"Jul",5,
  0,0,0,0,0,0,0,0,1,1,1,1,"Feb",4,
  0,0,0,0,0,0,0,0,0,1,1,1,"Oct",3,
  0,0,0,0,0,0,0,0,0,0,1,1,"Mar",2,
  1,1,1,1,1,1,1,1,1,1,1,0,"Jan",11,
)

I tried using the mutate() and sum() functions in a row-wise manner with a dynamic column range based on the value of Month at each respective row n(). However, only the first row value of Month (Nov) gets used--so, instead of creating a new column with the "Expected" sums shown in df, I get the sum of the column range Nov:Dec for each row.

test <- df |> 
  rowwise() |> 
  mutate(
    Actual = sum(c_across(df$Month[n()]:Dec))
  )

> test
# A tibble: 12 × 15
# Rowwise: 
     Jan   Feb   Mar   Apr   May   Jun   Jul   Aug   Sep   Oct   Nov   Dec Month Expected Actual
   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>    <dbl>  <dbl>
 1     1     0     0     0     0     0     0     0     0     0     0     0 Nov          0      0
 2     0     1     1     1     1     1     1     1     1     1     1     1 May          8      2
 3     0     0     1     1     1     1     1     1     1     1     1     1 Aug          5      2
 4     0     0     0     1     1     1     1     1     1     1     1     1 Jan          9      2
 5     0     0     0     0     1     1     1     1     1     1     1     1 Dec          1      2
 6     0     0     0     0     0     1     1     1     1     1     1     1 May          7      2
 7     0     0     0     0     0     0     1     1     1     1     1     1 Jun          6      2
 8     0     0     0     0     0     0     0     1     1     1     1     1 Jul          5      2
 9     0     0     0     0     0     0     0     0     1     1     1     1 Feb          4      2
10     0     0     0     0     0     0     0     0     0     1     1     1 Oct          3      2
11     0     0     0     0     0     0     0     0     0     0     1     1 Mar          2      2
12     1     1     1     1     1     1     1     1     1     1     1     0 Jan         11      1

I've reviewed other posts with a similar albeit not the same aim (1, 2, 3), but I am still struggling to find a solution. Any help or suggestions would be much appreciated!

Darren Tsai
  • 32,117
  • 5
  • 21
  • 51
Brad
  • 21
  • 1
  • 1
    For the first post, this is very well written and also an interesting question. – Darren Tsai Aug 17 '23 at 12:10
  • Hi, if any answers have solved your question, you could consider giving upvotes and picking one answer you prefer as accepted by clicking the check mark. Thanks! – Darren Tsai Aug 18 '23 at 05:55

2 Answers2

2

You can convert Month to a ordered factor, and row-sum those columns whose names are greater than or equal to Month.

library(dplyr)

df %>%
  mutate(Month = ordered(Month, levels = month.abb),
         Result = rowSums(across(Jan:Dec, ~ ifelse(cur_column() >= Month, .x, 0))))

# # A tibble: 12 × 15
#      Jan   Feb   Mar   Apr   May   Jun   Jul   Aug   Sep   Oct   Nov   Dec Month Expected Result
#    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <ord>    <dbl>  <dbl>
#  1     1     0     0     0     0     0     0     0     0     0     0     0 Nov          0      0
#  2     0     1     1     1     1     1     1     1     1     1     1     1 May          8      8
#  3     0     0     1     1     1     1     1     1     1     1     1     1 Aug          5      5
#  4     0     0     0     1     1     1     1     1     1     1     1     1 Jan          9      9
#  5     0     0     0     0     1     1     1     1     1     1     1     1 Dec          1      1
#  6     0     0     0     0     0     1     1     1     1     1     1     1 May          7      7
#  7     0     0     0     0     0     0     1     1     1     1     1     1 Jun          6      6
#  8     0     0     0     0     0     0     0     1     1     1     1     1 Jul          5      5
#  9     0     0     0     0     0     0     0     0     1     1     1     1 Feb          4      4
# 10     0     0     0     0     0     0     0     0     0     1     1     1 Oct          3      3
# 11     0     0     0     0     0     0     0     0     0     0     1     1 Mar          2      2
# 12     1     1     1     1     1     1     1     1     1     1     1     0 Jan         11     11

Note: You must use ordered() instead of factor() here because >= is not meaningful for ordinary factors.


Darren Tsai
  • 32,117
  • 5
  • 21
  • 51
1

An approach using purrr::map and pick for variable selection

library(dplyr)

df %>% 
  rowwise() %>% 
  mutate(Exp = purrr::map_dbl(Month, ~ sum(pick(.x:Dec)))) %>% 
  ungroup()

output

# A tibble: 12 × 15
     Jan   Feb   Mar   Apr   May   Jun   Jul   Aug   Sep   Oct   Nov   Dec Month
   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
 1     1     0     0     0     0     0     0     0     0     0     0     0 Nov  
 2     0     1     1     1     1     1     1     1     1     1     1     1 May  
 3     0     0     1     1     1     1     1     1     1     1     1     1 Aug  
 4     0     0     0     1     1     1     1     1     1     1     1     1 Jan  
 5     0     0     0     0     1     1     1     1     1     1     1     1 Dec  
 6     0     0     0     0     0     1     1     1     1     1     1     1 May  
 7     0     0     0     0     0     0     1     1     1     1     1     1 Jun  
 8     0     0     0     0     0     0     0     1     1     1     1     1 Jul  
 9     0     0     0     0     0     0     0     0     1     1     1     1 Feb  
10     0     0     0     0     0     0     0     0     0     1     1     1 Oct  
11     0     0     0     0     0     0     0     0     0     0     1     1 Mar  
12     1     1     1     1     1     1     1     1     1     1     1     0 Jan  
   Expected   Exp
      <dbl> <dbl>
 1        0     0
 2        8     8
 3        5     5
 4        9     9
 5        1     1
 6        7     7
 7        6     6
 8        5     5
 9        4     4
10        3     3
11        2     2
12       11    11
Darren Tsai
  • 32,117
  • 5
  • 21
  • 51
Andre Wildberg
  • 12,344
  • 3
  • 12
  • 29
  • This is a good idea!(+1) But I think it's not feasible in practice because `rowwise` + `map` is time-consuming if the dataset is slightly larger. – Darren Tsai Aug 18 '23 at 06:16
  • @DarrenTsai I agree, it’s definitely slower than approaches without rowwise, but might still be fast enough in practice. For me it served as a kind of proof of principle, if it’s possible to get to a solution without using too much workarounds, just working with the given variables. – Andre Wildberg Aug 18 '23 at 10:13