1

I have a dataset like the one below (actual dataset has 5M+ rows with no gaps), where I am trying to filter out rows where the sum of all numeric columns for the row itself and its previous and next rows is equal to zero.

N.B.

  • Time is a dttm column in the actual data.
  • Number of consecutive zeros can be more than 3 rows and in that case multiple rows will be filtered out.
# A tibble: 13 x 4
   group  Time  Val1  Val2
   <chr> <int> <dbl> <dbl>
 1 A         1   0     0  
 2 B         1   0.1   0  
 3 A         3   0     0  
 4 B         3   0     0  
 5 A         2   0     0  
 6 B         2   0.2   0.2
 7 B         4   0     0  
 8 A         4   0     0.1
 9 A         5   0     0  
10 A         6   0     0  
11 B         6   0.1   0.5
12 B         5   0.1   0.2
13 A         7   0     0  

See the example below for what is desired:

# A tibble: 13 x 8
   group  Time  Val1  Val2 rowsum leadsum lagsum   sum
   <chr> <int> <dbl> <dbl>  <dbl>   <dbl>  <dbl> <dbl>
 1 A         1   0     0      0       0     NA    NA  
 2 A         2   0     0      0       0      0     0     This will get filtered out! 
 3 A         3   0     0      0       0.1    0     0.1
 4 A         4   0     0.1    0.1     0      0     0.1
 5 A         5   0     0      0       0      0.1   0.1
 6 A         6   0     0      0       0      0     0     This will get filtered out!
 7 A         7   0     0      0      NA      0    NA  
 8 B         1   0.1   0      0.1     0.4   NA    NA  
 9 B         2   0.2   0.2    0.4     0      0.1   0.5
10 B         3   0     0      0       0      0.4   0.4
11 B         4   0     0      0       0.3    0     0.3
12 B         5   0.1   0.2    0.3     0.6    0     0.9
13 B         6   0.1   0.5    0.6    NA      0.3  NA  

So far I have tried to do this simply by using dplyr::lag() and dplyr::lead(); but this is extremely inefficient and throws a memory allocation error for the actual dataset:

>     Error in Sys.getenv("TESTTHAT") : 
>       could not allocate memory (0 Mb) in C function 'R_AllocStringBuffer'

This is what I have so far; I can first get the sum of Val1 and Val2 and then perform lead and lag but that won't resolve the issue.

df0 %>% 
  ##arrange by group is not necessary since we're grouping by that var
  arrange(group, Time) %>% 
  group_by(group) %>% 
  mutate(sum = Val1 + Val2 + lag(Val1) + lag(Val2) + lead(Val1) + lead(Val2)) # %>% 
  # filter(is.na(sum) | sum != 0)
  ## commenting out filter to show the full results
# >  # A tibble: 13 x 5
# >  # Groups:   group [2]
# >  group  Time  Val1  Val2   sum
# >  <chr> <int> <dbl> <dbl> <dbl>
# >  1  A   1     0     0      NA  
# !  -  A   2     0     0      0  
# >  2  A   3     0     0      0.1
# >  3  A   4     0     0.1    0.1
# >  4  A   5     0     0      0.1
# !  -  A   6     0     0      0  
# >  5  A   7     0     0      NA  
# >  6  B   1     0.1   0      NA  
# >  7  B   2     0.2   0.2    0.5
# >  8  B   3     0     0      0.4
# >  9  B   4     0     0      0.3
# >  10 B   5     0.1   0.2    0.9
# >  11 B   6     0.1   0.5    NA  
Toy dataset:
df0 <- structure(list(group = c("A", "B", "A", "B", "A", "B", 
                                "B", "A", "A", "A", "B", "B", "A"),
                      Time = c(1L, 1L, 3L, 3L, 2L, 2L, 4L, 4L, 5L, 6L, 6L, 5L, 7L), 
                      Val1 = c(0, 0.1, 0, 0, 0, 0.2, 0, 0, 0, 0, 0.1, 0.1, 0), 
                      Val2 = c(0, 0, 0, 0, 0, 0.2, 0, 0.1, 0, 0, 0.5, 0.2, 0)), 
                 row.names = c(NA, -13L), 
                 class = c("tbl_df", "tbl", "data.frame"))
M--
  • 25,431
  • 8
  • 61
  • 93
  • in your toy data `df0` - only row 4 should be filtered out, is that correct? – Donald Seinen Nov 01 '21 at 04:30
  • @DonaldSeinen I have shown the desired output for the toy dataset; row 5 (A-2) and row 10 (A-6) will be filtered. – M-- Nov 01 '21 at 04:46
  • 1
    Hi @M-- Thanks for the nice minimal example! Because your question is about efficiency on larger data, you may consider to add also a toy data set of sufficient size/complexity. Cheers – Henrik Nov 01 '21 at 09:13

5 Answers5

1

We can use base rle, or its faster implementation, rlenc implemented in the purler package.

library(tidyverse)
library(purler)
subsetter <- function(df){
  df %>%
    select(where(is.double)) %>%
    rowSums() %>%
    purler::rlenc() %>%
    filter(lengths >= 3L & values == 0L) %>%
    transmute(ids = map2(start, start + lengths, ~ (.x + 1) : (.y - 2))) %>%
    unlist(use.names = F)
}
# to get data as shown in example
df0 <- df0 %>%
  mutate(Time = as.character(Time)) %>%
  arrange(group, Time)

edge_cases <- tribble(
  ~group, ~Time, ~Val1, ~Val2,
  "C", "1", 0, 0,
  "C", "2", 0, 0,
  "C", "3", 0, 0,
  "C", "4", 0, 0,
)

df1 <- rbind(df0, edge_cases)
df1 %>%
  `[`(-subsetter(.),)

# A tibble: 13 x 4
   group Time   Val1  Val2
   <chr> <chr> <dbl> <dbl>
 1 A     1       0     0  
 2 A     3       0     0  
 3 A     4       0     0.1
 4 A     5       0     0  
 5 A     7       0     0  
 6 B     1       0.1   0  
 7 B     2       0.2   0.2
 8 B     3       0     0  
 9 B     4       0     0  
10 B     5       0.1   0.2
11 B     6       0.1   0.5
12 C     1       0     0  
13 C     4       0     0  
bench::mark(df1 %>% `[`(-subsetter(.),))[,c(3,5,7)]
# A tibble: 1 x 3
    median mem_alloc n_itr
  <bch:tm> <bch:byt> <int>
1   3.91ms    9.38KB    93
Donald Seinen
  • 4,179
  • 5
  • 15
  • 40
  • 1
    Thank you. It can exceed more than 3 rows, as long as it meets the lead, lag, and itself equal to zero. – M-- Nov 01 '21 at 05:31
  • @M-- I've adjusted the function. I'm sure however that this can be further optimized for memory usage and speed by cutting into the functions used and removing type checks etc. This might involve some restructuring of the data, i.e. making it a matrix, is that acceptable? – Donald Seinen Nov 01 '21 at 07:22
1

Since you tagged , here's a data.table-native solution:

library(data.table)
dt0 <- as.data.table(df0)
setorder(dt0, Time) # add 'group' if you want

isnum <- names(which(sapply(dt0, function(z) is.numeric(z) & !is.integer(z))))
isnum
# [1] "Val1" "Val2"

dt0[, sum0 := abs(rowSums(.SD)) < 1e-9, .SDcols = isnum
  ][, .SD[(c(0,sum0[-.N]) + sum0 + c(sum0[-1],0)) < 3,], by = .(group)
  ][, sum0 := NULL ][]
#      group  Time  Val1  Val2
#     <char> <int> <num> <num>
#  1:      A     1   0.0   0.0
#  2:      A     3   0.0   0.0
#  3:      A     4   0.0   0.1
#  4:      A     5   0.0   0.0
#  5:      A     7   0.0   0.0
#  6:      B     1   0.1   0.0
#  7:      B     2   0.2   0.2
#  8:      B     3   0.0   0.0
#  9:      B     4   0.0   0.0
# 10:      B     5   0.1   0.2
# 11:      B     6   0.1   0.5

Per your comment, both A-2 and A-6 have been removed.

Efficiencies:

  • rowSums is fast and efficient;
  • We shift using direct indexing with a default of 0; in data.table, this is handled very efficiently, and does not incur the (admittedly small) overhead of lead/lag/shift calls;
  • After we sum a row, we only row-shift this one value instead of four row-shifts per row.

Edit, for a slight performance improvement (15-20%):

dt0[
  dt0[, sum0 := abs(rowSums(.SD)) < 1e-9, .SDcols = isnum
    ][, .I[(c(0,sum0[-.N]) + sum0 + c(sum0[-1],0)) < 3], by=group ]$V1
][, sum0 := NULL][]

Admittedly that can be a little harder to follow, but it produces the same results in around 82% of the time (with this dataset). Thanks to @Henrik for helping me understand the .I and its benefits a little more.

r2evans
  • 141,215
  • 6
  • 77
  • 149
  • 1
    Hi @r2evans! Not that it matters in the context of the question, but note that `.SDcols` also takes a function, e.g. `.SDcols=is.numeric`. Cheers – Henrik Nov 01 '21 at 13:16
  • 1
    Thanks @Henrik! Yes, while I haven't worked that into my habits yet, in this case we'd need something more than that since `Time` will be an issue: `is.numeric` *here* will be true (should not be), and `is.double` works *here* but `is.double(Sys.time())` is true so not good overall. (I do like keep code compact, but in this case I don't think having `isnum` broken out is a bad thing for a walk-through, especially if not proficient in `data.table`.) Thanks again! – r2evans Nov 01 '21 at 13:22
  • 1
    It may also be worth considering to index `.I` instead of `.SD`, which have a large overhead (costly if there are many groups). – Henrik Nov 01 '21 at 13:30
  • *That's* where my `data.table`-skills are a little weak. I've seen examples with `.I` that appear to be double-nesting the variable, which I haven't fully grokked yet. (The layers of `data.table`-efficiencies are non-trivial :-). Are you able to translate the `[, .SD[...], by=group]` expression into a more efficient `.I`-variant? – r2evans Nov 01 '21 at 13:34
  • r2evans and @Henrik thank you for your answer and insight. Henrik, I've tried this and although it's faster and did overcome the memory issue, I wouldn't mind if it could be done a little faster :) – M-- Nov 01 '21 at 13:49
  • @r2evans this worked when I was testing this on the smaller subset of my data, but unfortunately returned an error for the whole dataset : ```setorder(dt0, P_DATETIME) # add 'group' if you want Error in forderv(x, cols, sort = TRUE, retGrp = FALSE, order = order, : Unable to allocate 5568576 bytes of working memory``` – M-- Nov 01 '21 at 13:58
  • 1
    Wow, if `forderv` can't work, then it sounds like your data might be just a little too big for in-memory work in R. You may need to consider a database solution, whether local/file-based like SQLike or DuckDB, or perhaps something a bit more formal (postgres, sql server, mariadb, etc). Or you can split your data up by group into individual files, load one at a time, subset it, then save back to disk. Not ideal, but I can't think of anything that won't be significantly more efficient than `setorder`/`forderv`. – r2evans Nov 01 '21 at 14:17
  • 2
    @r2evans A very simple example to (hopefully) give you a better feeling for subsetting within groups using indexing of `.I` (row number in _original_ (full) data set); e.g. select rows corresponding to max value by group. `d = data.table(g = c(1, 1, 1, 2, 2, 2), val = c(2, 5, 3, 3, 1, 3))`; step by step: `d[ , val == max(val), by = g]`; `d[ , .I[val == max(val)], by = g]`; `d[ , .I[val == max(val)], by = g]$V1`; `d[d[ , .I[val == max(val)], by = g]$V1]`; `d[ , .SD[val == max(val)], by = g]` – Henrik Nov 01 '21 at 14:26
  • Thanks @Henrik. I've seen that methodology before, and haven't internalized it yet. My mind sees `d[d[...]$V1]` and thinks it's a join-like operation, so it must be less-efficient; your walk-through does a better job than others I've seen (e.g., https://stackoverflow.com/a/23586059/3358272). I'm sold, thank you for the lesson :-) – r2evans Nov 01 '21 at 18:47
  • 2
    @r2evans You are welcome. In [a benchmark on larger data](https://stackoverflow.com/a/41838383), `.I[` was about 10 times faster than `.SD[`, so the speed-up may be significant. Good luck! – Henrik Nov 01 '21 at 20:35
1

You can try the following data.table option

setorder(setDT(df0), group, Time)[
  ,
  rs := rowSums(Filter(is.double, .SD))
][, .SD[!(rs == 0 & .N > 2 & (!rowid(rs) %in% c(1, .N)))], rleid(rs)][
  ,
  rleid := NULL
][]

which gives

    group Time Val1 Val2
 1:     A    1  0.0  0.0
 2:     A    3  0.0  0.0
 3:     A    4  0.0  0.1
 4:     A    5  0.0  0.0
 5:     A    7  0.0  0.0
 6:     B    1  0.1  0.0
 7:     B    2  0.2  0.2
 8:     B    3  0.0  0.0
 9:     B    4  0.0  0.0
10:     B    5  0.1  0.2
11:     B    6  0.1  0.5
ThomasIsCoding
  • 96,636
  • 9
  • 24
  • 81
1

This solution is mainly inspired by @r2evans ' one. It uses Reduce, +, and shift as opposed to @r2evans 's solution which is based on rowSums and c functions. I think that the improvement from this solution comes from using Reduce(+, .SD) instead of rowSums(.SD) on data.frame/data.table (and also from avoiding [, .SD[...], ...] when using data.table synthax); it is faster (at least on my PC) and more memory efficient (no convertion to matrix). Caveat: there is no direct equivalent for rowSums(.SD, na.rm=TRUE).

n = 1e7
dt0 = setDT(df0[sample(nrow(df0), n, replace=TRUE), ])
setorder(dt0, group, Time)
isnum = sapply(dt0, function(x) is.numeric(x) && !is.integer(x))
eps = sqrt(.Machine$double.eps)

# New solution
f1 = function() {
  ans = dt0[, is0 := {sum0 = abs(Reduce(`+`, .SD)) < eps; Reduce(`+`, shift(sum0, -1:1, fill=0)) < 3}, 
            by=group, .SDcols=isnum][(is0), !"is0"]
  
  dt0[, is0 := NULL] # remove is0 from the initial dataset
  ans
}

# similar to f1: easily adaptable to rowSums(.SD, na.rm=TRUE).
f2 = function() {
  # here I replace Reduce(`+`, .SD) with rowSums(.SD) just in case its na.rm argument is needed.
  ans = dt0[, is0 := {sum0 = abs(rowSums(.SD)) < eps; Reduce(`+`, shift(sum0, -1:1, fill=0)) < 3}, 
            by=group, .SDcols=isnum][(is0), !"is0"]
  
  dt0[, is0:=NULL] # remove is0 from the initial dataset
  ans
}

# r2evans first solution
f3 = function() {
  ans = dt0[
    dt0[, sum0 := abs(rowSums(.SD)) < eps, .SDcols = isnum
    ][, .I[(c(0,sum0[-.N]) + sum0 + c(sum0[-1],0)) < 3], by=group ]$V1
  ][, sum0 := NULL][]
  
  dt0[, sum0 := NULL] # remove sum0 from the initial dataset
  ans
}

# r2evans second solution
f4 = function() {
  ans = dt0[, sum0 := abs(rowSums(.SD)) < eps, .SDcols = isnum
  ][, .SD[(c(0,sum0[-.N]) + sum0 + c(sum0[-1],0)) < 3,], by = .(group)
  ][, sum0 := NULL ][]
  
  dt0[, sum0:=NULL] # remove sum0 from the initial dataset
  ans
}

# modified version of r2evans second solution: similar to f4 but avoid [, .SD[...], by=group]
f5 = function() {
  ans = dt0[, sum0 := abs(rowSums(.SD)) < eps, .SDcols = isnum
  ][, sum0 := (c(0,sum0[-.N]) + sum0 + c(sum0[-1],0)) < 3, by = .(group)
  ][(sum0), !"sum0"][]
  
  dt0[, sum0:=NULL] # remove sum0 from the initial dataset
  ans
}

benchmark

bench::mark(
  f1(),
  f2(),
  f3(),
  f4(),
  f5(),
  iterations=5L, check=FALSE
)

# A tibble: 5 x 13
  expression    min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
  <bch:expr> <bch:> <bch:>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
1 f1()        347ms  406ms      2.49  698.47MB     5.48     5    11      2.01s
2 f2()        529ms  578ms      1.69  851.02MB     4.06     5    12      2.96s
3 f3()        717ms  821ms      1.22    1.25GB     3.40     5    14      4.12s
4 f4()        889ms  956ms      1.04    1.57GB     5.01     5    24      4.79s
5 f5()        642ms  677ms      1.40    1.07GB     3.37     5    12      3.56s
 

Based on this result, the first solution is 2+ faster than f3 and f4 and is also more memory efficient.

I am using the dev version of data.table (data.table 1.14.3)

0
library(tidyverse)
df0 %>%
  arrange(group, Time) %>%  # EDIT to arrange by time (and group for clarity)
  rowwise() %>%
  mutate(sum = sum(c_across(Val1:Val2))) %>%
  group_by(group) %>%
  filter( !(sum == 0 & lag(sum, default = 1) == 0 & lead(sum, default = 1) == 0)) %>%
  ungroup()

# A tibble: 11 x 5
   group  Time  Val1  Val2   sum
   <chr> <int> <dbl> <dbl> <dbl>
 1 A         1   0     0     0  
 2 A         3   0     0     0  
 3 A         4   0     0.1   0.1
 4 A         5   0     0     0  
 5 A         7   0     0     0  
 6 B         1   0.1   0     0.1
 7 B         2   0.2   0.2   0.4
 8 B         3   0     0     0  
 9 B         4   0     0     0  
10 B         5   0.1   0.2   0.3
11 B         6   0.1   0.5   0.6
Jon Spring
  • 55,165
  • 4
  • 35
  • 53
  • 1
    Thank you for your answer. This is the same as what I have mentioned (I can first get the sum and then perform lead and lag). Unfortunately, this won't resolve the memory issue. – M-- Nov 01 '21 at 04:43
  • 1
    Moreover, sorting by time is necessary. Otherwise, wrong rows (timestamps) will get filtered out. – M-- Nov 01 '21 at 04:47