I have a dataset like the one below (actual dataset has 5M+ rows with no gaps), where I am trying to filter out rows where the sum of all numeric columns for the row itself and its previous and next rows is equal to zero.
N.B.
Time
is adttm
column in the actual data.- Number of consecutive zeros can be more than 3 rows and in that case multiple rows will be filtered out.
# A tibble: 13 x 4
group Time Val1 Val2
<chr> <int> <dbl> <dbl>
1 A 1 0 0
2 B 1 0.1 0
3 A 3 0 0
4 B 3 0 0
5 A 2 0 0
6 B 2 0.2 0.2
7 B 4 0 0
8 A 4 0 0.1
9 A 5 0 0
10 A 6 0 0
11 B 6 0.1 0.5
12 B 5 0.1 0.2
13 A 7 0 0
See the example below for what is desired:
# A tibble: 13 x 8
group Time Val1 Val2 rowsum leadsum lagsum sum
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 A 1 0 0 0 0 NA NA
2 A 2 0 0 0 0 0 0 This will get filtered out!
3 A 3 0 0 0 0.1 0 0.1
4 A 4 0 0.1 0.1 0 0 0.1
5 A 5 0 0 0 0 0.1 0.1
6 A 6 0 0 0 0 0 0 This will get filtered out!
7 A 7 0 0 0 NA 0 NA
8 B 1 0.1 0 0.1 0.4 NA NA
9 B 2 0.2 0.2 0.4 0 0.1 0.5
10 B 3 0 0 0 0 0.4 0.4
11 B 4 0 0 0 0.3 0 0.3
12 B 5 0.1 0.2 0.3 0.6 0 0.9
13 B 6 0.1 0.5 0.6 NA 0.3 NA
So far I have tried to do this simply by using dplyr::lag()
and dplyr::lead()
; but this is extremely inefficient and throws a memory allocation error for the actual dataset:
> Error in Sys.getenv("TESTTHAT") : > could not allocate memory (0 Mb) in C function 'R_AllocStringBuffer'
This is what I have so far; I can first get the sum of Val1
and Val2
and then perform lead
and lag
but that won't resolve the issue.
df0 %>%
##arrange by group is not necessary since we're grouping by that var
arrange(group, Time) %>%
group_by(group) %>%
mutate(sum = Val1 + Val2 + lag(Val1) + lag(Val2) + lead(Val1) + lead(Val2)) # %>%
# filter(is.na(sum) | sum != 0)
## commenting out filter to show the full results
# > # A tibble: 13 x 5
# > # Groups: group [2]
# > group Time Val1 Val2 sum
# > <chr> <int> <dbl> <dbl> <dbl>
# > 1 A 1 0 0 NA
# ! - A 2 0 0 0
# > 2 A 3 0 0 0.1
# > 3 A 4 0 0.1 0.1
# > 4 A 5 0 0 0.1
# ! - A 6 0 0 0
# > 5 A 7 0 0 NA
# > 6 B 1 0.1 0 NA
# > 7 B 2 0.2 0.2 0.5
# > 8 B 3 0 0 0.4
# > 9 B 4 0 0 0.3
# > 10 B 5 0.1 0.2 0.9
# > 11 B 6 0.1 0.5 NA
Toy dataset:
df0 <- structure(list(group = c("A", "B", "A", "B", "A", "B",
"B", "A", "A", "A", "B", "B", "A"),
Time = c(1L, 1L, 3L, 3L, 2L, 2L, 4L, 4L, 5L, 6L, 6L, 5L, 7L),
Val1 = c(0, 0.1, 0, 0, 0, 0.2, 0, 0, 0, 0, 0.1, 0.1, 0),
Val2 = c(0, 0, 0, 0, 0, 0.2, 0, 0.1, 0, 0, 0.5, 0.2, 0)),
row.names = c(NA, -13L),
class = c("tbl_df", "tbl", "data.frame"))