0

Dear data manipulation wizards,

I am trying to summarize my dataframe in a way that makes reference to row numbers, but I can't find a way to do it in a tidy way. My dataframe looks like this.

# Sample data frame.
df <- data.frame(value = c(1,2,1,1,2,4,5,3,2))

  value
1     1
2     2
3     1
4     1
5     2
6     4
7     5
8     3
9     2

I need to create a column, which says TRUE if the corresponding number in value as well as the numbers in next 4 consecutive rows are all larger than or equal to 2. The resulting dataframe should look like this:

  value largerThan
1     1      FALSE
2     2      FALSE
3     1      FALSE
4     1      FALSE
5     2       TRUE
6     4         NA
7     5         NA
8     3         NA
9     2         NA

Note the four NA in the last four rows of largerThan. This is because these these rows don't have 4 consecutive rows after them, so they can't be evaluated. This is what is tripping me up, together with the fact that I don't know how to make reference to row numbers when using tidyverse syntax. This was more straightforward with for loops, but I can't think of equivalents.

Suggestions are much appreciated! Especially if you have tidyverse or dplyr solutions, since these are the packages used in the rest of the code.

AnilGoyal
  • 25,297
  • 4
  • 27
  • 45
Sol
  • 724
  • 1
  • 5
  • 18

3 Answers3

1

sorry, not familiar enough with the tidyverse.. here is a possible data.table solution. Perhaps you can transelate it to tidyverse

library( data.table )
setDT(df)
df[, largerThan := frollsum( value >= 2, n = 5, align = "left" ) > 4 ]
#    value largerThan
# 1:     1      FALSE
# 2:     2      FALSE
# 3:     1      FALSE
# 4:     1      FALSE
# 5:     2       TRUE
# 6:     4         NA
# 7:     5         NA
# 8:     3         NA
# 9:     2         NA
Wimpel
  • 26,031
  • 1
  • 20
  • 37
  • Thanks! I'll try to work out if there is a tidyverse equivalent. It woudl be great if it has something as elegant as this :) – Sol Feb 26 '21 at 10:49
1

Do this

df$largerthan <- zoo::rollsum(df$value >= 2, 5, na.pad = T, align = "left") >=5

  value largerthan
1     1      FALSE
2     2      FALSE
3     1      FALSE
4     1      FALSE
5     2       TRUE
6     4         NA
7     5         NA
8     3         NA
9     2         NA

Following the logic, a baseR way

df$largerthan <- rowSums(cbind(df >=2, c(df[-1,], NA) >=2, c(df[-c(1:2), ], NA, NA) >=2, c(df[-c(1:3),], NA, NA, NA) >=2, c(df[-c(1:4),], NA, NA, NA, NA)>=2)) >=5

If you want to do it in dplyr

df %>% mutate(largerThan = (value >=2 & lead(value) >=2 & lead(value,2) >= 2 & lead(value, 3) >= 2 & lead(value,4) >= 2))
AnilGoyal
  • 25,297
  • 4
  • 27
  • 45
  • 1
    Thanks! I didn't know about "rollsum". I'll wait to see if someone has an alternative that doesn't require installing a new package, but this already helps a lot. Much appreciated! – Sol Feb 26 '21 at 10:48
  • 1
    For doing it in base R, see this [answer](https://stackoverflow.com/a/19200945/2884859) – AnilGoyal Feb 26 '21 at 10:53
1

Here is a dpylr version, however it is less elegant than the other solutions using rollsums. dpylr has the functions lead() and lag(). The solution would look the following way:

df <- data.frame(value = c(1,2,1,1,2,4,5,3,2))
df = df %>% mutate(
  largerThan = value >= 2 & lead(value) >= 2 & lead(value, 2) >= 2 &
    lead(value, 3) >= 2 & lead(value, 4) >= 2
)

As more leads you need, as easier it gets with alternative solutions.