Choosing first instance of a value by year in a time series

Question

I have a dataset that looks like this (but with more years of data):

dat <- data.frame(date = as.Date(c("2000-01-01","2000-03-31","2000-07-01","2000-09-30", 
                                   "2001-01-01","2001-03-31","2001-07-01","2001-09-30")),
                  value = c(0.8,1,0.2,0,0.7,1,0.2,0))

I would like to choose the first instance that "value" is >= 0.8 for each year.

So for the above dataset, I would expect the output to be a data frame with two rows and two columns:

new_dat <- data.frame(date = as.Date(c("2000-01-01", "2001-03-31")),
                      value = c(0.8,0.7))
print(new_dat)

I have been trying to accomplish this using dplyr:

dat_grouped <- dat %>%
  mutate(year = year(date))%>%
  group_by(year) %>%
  distinct(value >= 0.8, date = date) #wanted to keep the date column

It gives me TRUE FALSE values for the "value" column, but I can't seem to find a good way to select the first TRUE value. I've tried wrapping distinct() with first() and I've tried piping to which.min(), but neither worked.

I found this entry, but I was hoping for a tidy solution. I'm also having an issue adapting that code to my dataset. I get " Error in apply(x, 2, my.first) : dim(X) must have a positive length "

I would also like to perform the same request but for the first occasion that value <= 0.2. But I assume it would be the same process with a different logical request. Perhaps the logical operator is not the way to go?

Any suggestions are greatly appreciated. Thank you.

Yes, the second row for new_dat should be 1, as the prior value is 0.7. — Joshua Culpepper, Jul 07 '21 at 14:50

score 3 · Accepted Answer · answered Jul 07 '21 at 02:29

3

You can use slice -

library(dplyr)
library(lubridate)

dat %>%
  group_by(year = year(date)) %>%
  slice(match(TRUE, value >= 0.8)) %>%
  ungroup

#   date       value  year
#  <date>     <dbl> <int>
#1 2000-01-01   0.8  2000
#2 2001-03-31   1    2001

If every year is guaranteed to have at least one value which is greater than 0.8 then you can also use which.max -

dat %>%
  group_by(year = year(date)) %>%
  slice(which.max(value >= 0.8)) %>%
  ungroup

answered Jul 07 '21 at 02:29

Ronak Shah

377,200
20
156
213

This does exactly what I wanted. Thank you. Is it possible to use apply(), or something similar, to use the slice function across columns? For example, if I had several "value" columns, such as value_2, value_3, value_4, that all have different values. How does the logical within slice(match()) work to move across columns? – Joshua Culpepper Jul 07 '21 at 15:21
@JoshuaCulpepper You may use `if_any`, something like this - `dat %>% filter(if_any(starts_with('value'), ~. >= 0.8)) %>% group_by(year = year(date)) %>% slice(1L) %>% ungroup` – Ronak Shah Jul 08 '21 at 00:27

IceCreamToucan · Answer 2 · 2021-07-07T22:07:24.967

You can use dplyr::filter to get only values >= 0.8, then group by the year (which you can get with lubridate::year), and dplyr::slice_min to get the first date.

dat <- data.frame(date = as.Date(c("2000-01-01","2000-03-31","2000-07-01","2000-09-30", 
                                   "2001-01-01","2001-03-31","2001-07-01","2001-09-30")),
                  value = c(0.8,1,0.2,0,0.7,1,0.2,0))

library(lubridate)
#> 
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#> 
#>     date, intersect, setdiff, union
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

dat %>% 
  filter(value >= .8) %>% 
  group_by(year = year(date)) %>% 
  slice_min(date)
#> # A tibble: 2 x 3
#> # Groups:   year [2]
#>   date       value  year
#>   <date>     <dbl> <dbl>
#> 1 2000-01-01   0.8  2000
#> 2 2001-03-31   1    2001

^{Created on 2021-07-06 by the reprex package (v2.0.0)}

If your data is already sorted by date, you could skip the filter and use the method below (or one of Ronak's methods)

dat %>% 
  group_by(year = year(date)) %>% 
  slice_max(value >= 0.8, with_ties = FALSE)
#> # A tibble: 2 x 3
#> # Groups:   year [2]
#>   date       value  year
#>   <date>     <dbl> <dbl>
#> 1 2000-01-01   0.8  2000
#> 2 2001-03-31   1    2001

^{Created on 2021-07-07 by the reprex package (v2.0.0)}

hello_friend · Answer 3 · 2021-07-07T00:01:57.300

3 Base R solutions:

# Repeatedly subsetting: data.frame => stdout(console)
subset(
  subset(
    with(
      dat,
      dat[order(date),]
    ),
    value >= 0.8
  ), 
  ave(
    substr(
      date, 
      1, 
      4
    ), 
    substr(
      date, 
      1, 
      4
    ), 
    FUN = seq.int
  ) == 1
)

# All in one base R using `Split-Apply-Combine`: 
# data.frame => stdout(console)
data.frame(
  do.call(
    rbind, 
    lapply(
      with(
        dat, 
        split(
          dat, 
          substr(date, 1, 4)
        )
      ),
      function(x){
        head(
          subset(
            with(x, x[order(date),]),
            value >= 0.8
          ), 
          1
        )  
      }
    )
  ),
  row.names = NULL
)

# In stages Base R: 
# Subset out values not meeting the threshold value
# criteria: above_threshold_df => data.frame
above_threshold_df <- subset(
  with(dat, dat[order(date),]), 
  value >= 0.8
)

# Extract the year from the date variable: 
# grp => integer vector 
grp <- with(above_threshold_df, substr(date, 1, 4))

# Use the group vector to extract the first entry from 
# each year that meets the threshold: 
# res => data.frame
res <- subset(
  above_threshold_df,
  ave(
    grp, 
    grp, 
    FUN = seq.int
  ) == 1
)

Choosing first instance of a value by year in a time series

3 Answers3