1

I have count data arranged by quarters that looks something like this. Let's say each count is the number of car crashes in that quarter.

df <- structure(list(quarter = c("Q4", "Q5", "Q8", "Q6", "Q3", "Q7"), column = c(6, 7, 10, 8, 5, 9), Q1 = c(8L, 7L, 4L, 10L, 9L, 6L), Q2 = c(9L, 6L, 7L, 8L, 5L, 4L), Q3 = c(15L, 17L, 14L, 20L, 19L, 16L), Q4 = c(25L, 21L, 24L, 23L, 20L, 22L), Q5 = c(20L, 22L, 16L, 18L, 17L, 21L), Q6 = c(15L, 13L, 12L, 16L, 10L, 11L), Q7 = c(9L, 7L, 11L, 6L, 5L, 8L), Q8 = c(23L, 24L, 26L, 22L, 27L, 25L)), row.names = c(1L, 2L, 3L, 4L, 5L, 6L), class = "data.frame", .Names = c("quarter", "endcolumn", "Q1", "Q2", "Q3", "Q4", "Q5", "Q6", "Q7", "Q8"))

  quarter endcolumn Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8
1      Q4         6  8  9 15 25 20 15  9 23
2      Q5         7  7  6 17 21 22 13  7 24
3      Q8        10  4  7 14 24 16 12 11 26
4      Q6         8 10  8 20 23 18 16  6 22
5      Q3         5  9  5 19 20 17 10  5 27
6      Q7         9  6  4 16 22 21 11  8 25

For each row, I want a cumulative sum of total car accidents up to a certain point in time (t0), as indicated by the quarter variable, e.g. for row 1 I want to know the total number of car crashes from Q1 to Q4. The resulting df should look something like this.

  quarter endcolumn Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 t0
1      Q4         6  8  9 15 25 20 15  9 23 57
2      Q5         7  7  6 17 21 22 13  7 24 73
3      Q8        10  4  7 14 24 16 12 11 26 114
4      Q6         8 10  8 20 23 18 16  6 22 95
5      Q3         5  9  5 19 20 17 10  5 27 33
6      Q7         9  6  4 16 22 21 11  8 25 88

Note that the range for the sums varies over the rows. To make the above table, I translated the 'quarter' values into the variable 'endcolumn' (e.g. Q4 = column 6), and then used these values as below:

df$t0 <- NA 
df[1, 11] <- rowSums(df[1, 3:6])
df[2, 11] <- rowSums(df[2, 3:7])
df[3, 11] <- rowSums(df[3, 3:10])
df[4, 11] <- rowSums(df[4, 3:8])
df[5, 11] <- rowSums(df[5, 3:5])
df[6, 11] <- rowSums(df[6, 3:9])

As my actual dataset is much larger than the one here, ideally I'd write a function that takes the values from endcolumn and uses them as the endpoint for rowsum.

There's a lot of help on passing variables names to functions (e.g. here and here), but something like this rowSums(df[ , 3:which(names(df) == "endcolumn")]) just adds the endcolumn values to the values in Q1.

This is the closest I can find to my question. However, I've figured out that if I try to do this, it still doesn't allow me to vary the range of columns that I want to sum by row (i.e. 3:6 for the first row, 3:7 for the second).

Is there a way to pass the values from endcolumn to a function using rowSums?

NB: I realise a second option might be to use the values in the quarter variable (e.g. Q4, Q5) and use them to identify the endpoint column. The reason I haven't outlined my question in this way is because once I write this function for t0, I also want to be able to find sums for a year before and year after t0, and I think it will be easier to tweak a function by using some variation of "3: endcolumn +/- 4" rather than trying to change Q9 to Q5 or Q13. But if there's an easy way to do the latter, that would also be helpful.

Community
  • 1
  • 1
al7
  • 13
  • 3

2 Answers2

2

not the neatest but this should work:

for(i in 1:nrow(df)){df$t0[i]  <- sum(df[i, 3:df$endcolumn[i]])}

that is: for each row, take the relevant index from the endcolumn column and sum columns from 3rd to index. put this sum in this line t0 cell.

Spätzle
  • 709
  • 10
  • 20
  • now for the latter part of your question: if you'll define another column, say **startcolumn** you'll be able to sum each row differently regarding your wishes, simply replace the argument inside `sum` with `df[i, df$startcolumn[i]:df$endcolumn[i]]` – Spätzle Apr 17 '16 at 15:53
  • you can also easily refer in each line to the previous/next one. it is really straightforward. – Spätzle Apr 17 '16 at 15:55
  • Thanks! This works really well. I wouldn't really be interested in changing the start column (only the end column +/- 4), but I assume then I could create different variables (tnendcolumn) to take the index from each time. For example: t0endcolumn <- c(5, 6, 9, 7, 4, 8); t-1endcolumn <- ifelse(t0endcolumn - 4 > 3, t0endcolumn - 4, 3); t1endcolumn <- ifelse(t0endcolumn + 4 < 10, t0endcolumn + 4, 10). And then use for(i in 1:nrow(df)){df$t0[i] <- sum(df[i, 3:df$tnendcolumn[i]])}. Or is there a better way to do this? – al7 Apr 17 '16 at 16:17
  • I'm not sure what are you trying to do in this code, could you explain? – Spätzle Apr 17 '16 at 16:52
  • I want to be able to calculate the sums using a changing endcolumn, not start column. So if the original sum (t0) is for the range 3:9, and I want to know the sum for a year (i.e. four quarters) earlier, I want to have a sum (t-1) for the range 3:5. I think the easiest way to do this is to define another end column (t-1endcolumn), and use this as the index to create t1. Or can I use something like `for(i in 1:nrow(df)){df$t-1[i] <- sum(df[i, 3:(df$endcolumn[i]-4)])}` – al7 Apr 17 '16 at 17:16
  • I'd stick to another column since it enables you greater freedom than a constant subtraction of 4. Either way I suggest you put a safeguard on the summing range to prevent costly mistakes: `sum(df[i, 3:(max(3,($endcolumn[i]-4)))])` or `sum(df[i, 3:(max(3,(df$endcolumn[i]-df$t-1endcolumn[i])))])`, you get the essence of the idea. – Spätzle Apr 18 '16 at 06:07
0

I would suggest doing this with the help of the tidyr and dplyr packages instead. Take a look at the code example below (note that it seems in your question, you have the wrong(?) t0 results for most rows because you take the rowSum always in the first row (rowSum[1,...]) despite the quarter values differing in each row - although if these are quarterly car crashes, I'm not quite sure how that's possible?). Anyways, here's the example for summarizing up to a certain quarter (using piping %>% syntax to improve readability). If you're unfamiliar with tidyr and dyplr I highly recommend checking out the vignettes for those packages.

library(dplyr)
library(tidyr)
# take your data frame
df %>% 
  # gather into long format
  gather(Q_column, value, -quarter, -endcolumn) %>% 
  # extract number from the column name
  extract(Q_column, "column", "Q(\\d+)") %>% 
  # group by the quarter and end column for calculations
  group_by(quarter, endcolumn) %>% 
  # summarize up to the desired column
  summarize(t0 = sum(value[column <= endcolumn - 2])) 
sebkopf
  • 2,335
  • 19
  • 18
  • 1
    Oops, I edited my example to fix this. I'm familiar with tidyr and dplyr, but didn't think of restructuring the data like this. This definitely works, but spatzle's answer is easier to manipulate. Thanks anyway! – al7 Apr 17 '16 at 16:16