how to sum up specific cells of a column based on conditions from another column in R?

Question

I have a small problem. I am working on a data frame. It is as follows:

df1
Duration    Intensity
NA             NA
10           0.1016
10           0.0254
NA             NA
NA             NA
10           0.0508
10           0.0508
10           0.1016
NA             NA
10           0.0254

I want to calculate the cumulative sum of “Intensity” for each events of 10 min “Duration”. Or in other words, I want to sum up the values of “Intensity” in between each ‘NA’ values in “Intensity” column. The output should look like this:

df2

Duration    Intensity   Intensity_sum
NA             NA   
10           0.1016 
10           0.0254        0.127
NA             NA   
NA             NA   
10           0.0508 
10           0.0508 
10           0.1016        0.2032
NA             NA   
10           0.0254        0.0254

I tried the following code:

library(dplyr)    
df2 <- as.data.frame(mutate(df1,Intensity_sum = with(df1,Duration==10,cumsum(Intensity))))

But I am only receiving TRUE or FALSE results, not the values.

Possible duplicate of [How to sum a variable by group?](http://stackoverflow.com/questions/1660124/how-to-sum-a-variable-by-group) — nrussell, Dec 30 '16 at 18:03
@nrussell kind of, except that the OP would have to first construct the group ID, and also to "merge" on the aggregated sum to the final observation of each group. — lmo, Dec 30 '16 at 18:05
@nrussell Thanks, but my problem is not exactly the same as that one — happy123, Dec 30 '16 at 18:22

akrun · Accepted Answer · 2016-12-30T18:11:14.170

We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by the run-length-id of logical vector (!is.na(Duration)), we assign (:=) the sum of 'Intensity' as 'Intensity_sumand later replace theduplicated` elements with 'NA'

library(data.table)
setDT(df1)[, Intensity_sum := sum(Intensity), rleid(!is.na(Duration))]
df1[duplicated(Intensity_sum, fromLast=TRUE), Intensity_sum := NA]
df1
#     Duration Intensity Intensity_sum
# 1:       NA        NA            NA
# 2:       10    0.1016            NA
# 3:       10    0.0254        0.1270
# 4:       NA        NA            NA
# 5:       NA        NA            NA
# 6:       10    0.0508            NA
# 7:       10    0.0508            NA
# 8:       10    0.1016        0.2032
# 9:       NA        NA            NA
#10:       10    0.0254        0.0254

Or in a single line we can group by rleid (as described above) and assign the new column by replicating NA along with the sum of 'Intensity'

setDT(df1)[, Intensity_sum := rep(c(NA,sum(Intensity)), c(.N-1, 1)), rleid(!is.na(Duration))]

lmo · Answer 2 · 2017-01-02T15:59:00.963

Using the fourth data.table method I suggest below, it is possible to achieved the desired result in base R. This method uses ave and replace

ave(df$Intensity, cumsum(is.na(df$Intensity)),
    FUN=function(x) replace(NA, (seq_along(x) == length(x) & length(x) > 1),
                            sum(x, na.rm=TRUE)))
 [1]     NA     NA 0.1270     NA     NA     NA     NA 0.2032     NA 0.0254

Here are some additional data.table options

df[, val := ifelse(is.na(shift(Intensity, type="lead")),
   sum(Intensity, na.rm=TRUE), NA), by=cumsum(is.na(Intensity))]

In this instance, grouping is performed using the cumsum function and placement of the sum is performed using ifelse that uses shift to check for NA boundaries.

This is a bit slower and will add 0 instead of NA if there is a single NA value with adjacent numeric values. The first argument of ifelse can be modified so this doesn't happen.

Second, a faster solution that using c and subsetting instead of ifelse. I added logic so that single rows of NA will receive and NA rather than 0.

df[, val2 := c(NA, sum(Intensity, na.rm=TRUE))[(seq_len(.N) == .N &.N > 1) + 1],
   by=cumsum(is.na(Intensity))]

Revising this method slightly to take advantage of a favorite manipulation of NA. we get the following.

df[, val3 := sum(Intensity, na.rm=TRUE) * NA^(seq_len(.N) != .N | .N == 1),
   by=cumsum(is.na(Intensity))]

This method takes advantage of the fact that NA^x where x != 0 returns NA, but NA^0 returns 1 and NA^FALSE is equivalent to NA^0.

A fourth alternative is to use replace. This function replaces a vector with values at specified indices.

df[, val4 := replace(NA, (seq_len(.N) == .N & .N != 1),
                     sum(Intensity, na.rm=TRUE)),
   by=cumsum(is.na(Intensity))]

The results of the three methods are shown below.

df
    Duration Intensity   val3    val   val2   val4
 1:       NA        NA     NA     NA     NA     NA
 2:       10    0.1016     NA     NA     NA     NA
 3:       10    0.0254 0.1270 0.1270 0.1270 0.1270
 4:       NA        NA     NA 0.0000     NA     NA
 5:       NA        NA     NA     NA     NA     NA
 6:       10    0.0508     NA     NA     NA     NA
 7:       10    0.0508     NA     NA     NA     NA
 8:       10    0.1016 0.2032 0.2032 0.2032 0.2032
 9:       NA        NA     NA     NA     NA     NA
10:       10    0.0254 0.0254 0.0254 0.0254 0.0254

how to sum up specific cells of a column based on conditions from another column in R?

2 Answers2