How to count cumulative non-zero variables in a very large data frame without for loop in R?

Question

I have a huge df and trying to execute follow code takes too long. Anyway to speed it up?

df$col2 <- 0
for (i in 2:nrow(df)) {
    if (df$col1 > 0) {
        df$col2[i] <- df$col2[i-1] + 1
    }
    else {
        df$col2[i] <- 0
    }
}

Example data

df <- data.frame(col1 = c(1, 0, 10, 28, 0, 0, 2))

Expected result

I am trying to use col2 to count cumulative non-zero variables in col1 and reset the count when I hit a 0 in col1.

You say you have a huge dataframe. How many rows and columns? — Uwe, Feb 08 '17 at 09:12

Uwe · Answer 1 · 2017-02-08T11:49:02.033

Here is a solution that uses the rleid() function from the data.table package:

library(data.table)
setDT(df)[, .(col1, col2 = cumsum(col1 != 0)), by = rleid(col1 != 0)][, rleid := NULL][]
#   col1 col2
#1:    1    1
#2:    0    0
#3:   10    1
#4:   28    2
#5:    0    0
#6:    0    0
#7:    2    1

The rleid() function is a convenience function for generating a run-length type id column to be used in grouping operations (?rleid). It is applied to the sequence of logical values created by col1 != 0 which distinghuishes zero and non-zero values. Within each group, cumsum() is used to count the non-zero values. Finally, the rleidcolumn is removed from the result.

As alternative, cumsum() could be replaced by a simple sequence.

setDT(df)[, .(col1, col2 = seq_len(.N)), by = rleid(col1 != 0)][col1 == 0, col2 := 0][
  , rleid := NULL][]

However, this also counts subsequent zero values which wasn't requested. So, these counts in col2 have to be reset to zero for all rows where col1 is zero.

score 0 · Answer 2 · answered Feb 04 '17 at 00:56

0

I'll annotate what I think are the errors and show what I think would be a faster equivalent:

df$col2 <- 0
for (i in 2:nrow(df)) { 
    if (df$col1 > 0) {   # at least a "semantic" error
        df$col2[i] <- df$col2[i-1] + 1}
    else { df$col2[i] <- 0 }
}

Asking for if (df$col1 > 0) will be comparing a vector to a single value without indexing which of the multiple values should be tested. Hence, calling it a "semantic error" because R will only test the first value of df$col1 in each pass through the loop. I'm guessing you want something along these lines:

df$col2 <- cumsum( c(0, head(df$col1, -1) > 0 ) )
df$col2[ c(TRUE, head(df$col1 == 0)]  <- 0

That will construct the running count you asked for and then zero out the entries where the was no increment.

answered Feb 04 '17 at 00:56

IRTFM

258,963
21
364
487

Thank you for point out the semantic error. What I meant to type was df$col1[i] >0 – BillyH Feb 04 '17 at 01:28
@42- It seems there's a typo in your second line. (and the OP has clarified his question) – Scarabee Feb 05 '17 at 00:02
@42- It seems the running count is not reset when a zero is hit in `col1` as the OP had requested in his clarifying comment. – Uwe Feb 08 '17 at 08:16

score 0 · Answer 3 · edited May 23 '17 at 12:33

0

There's a fast (but tricky) solution in this answer.

So in your case:

df <- data.frame(col1 = c(1, 0, 10, 28, 0, 0, 2))

f7 <- function(x) { tmp <- cumsum(x) ; tmp - cummax((!x)*tmp) }

df$col2 <- f7(df$col1 > 0)

df
#   col1 col2
# 1    1    1
# 2    0    0
# 3   10    1
# 4   28    2
# 5    0    0
# 6    0    0
# 7    2    1

edited May 23 '17 at 12:33

Community

1
1

answered Feb 04 '17 at 23:57

Scarabee

5,437
5
29
55

Thank. This helps a lot. – BillyH Feb 05 '17 at 06:12

How to count cumulative non-zero variables in a very large data frame without for loop in R?

3 Answers3