How can I select rows from dataframe based on date and value?

Question

I have a dataframe with lots of countries and their total cases and new cases on different dates. It looks as follows:

  iso_code continent     location date       total_cases new_cases stringency_index population
  <chr>    <chr>         <chr>    <chr>            <dbl>     <dbl>            <dbl>      <dbl>
1 ABW      North America Aruba    2020-03-13           2         2              0       106766
2 ABW      North America Aruba    2020-03-19          NA        NA             33.3     106766
3 ABW      North America Aruba    2020-03-20           4         2             33.3     106766
4 ABW      North America Aruba    2020-03-21          NA        NA             44.4     106766
5 ABW      North America Aruba    2020-03-22          NA        NA             44.4     106766
6 ABW      North America Aruba    2020-03-23          NA        NA             44.4     106766

I am able to filter the dataframe to get all rows where new_cases >= 5:

df_filtered <- df %>% filter(new_cases >= 5)

However, this gives me all rows where new_cases are equal to or greater than 5:

  iso_code continent     location date       total_cases new_cases stringency_index population
  <chr>    <chr>         <chr>    <chr>            <dbl>     <dbl>            <dbl>      <dbl>
1 ABW      North America Aruba    2020-03-24          12         8             44.4     106766
2 ABW      North America Aruba    2020-03-25          17         5             44.4     106766
3 ABW      North America Aruba    2020-03-27          28         9             44.4     106766
4 ABW      North America Aruba    2020-03-30          50        22             85.2     106766
5 ABW      North America Aruba    2020-04-01          55         5             85.2     106766
6 ABW      North America Aruba    2020-04-03          60         5             85.2     106766

How can I only get the row with the earliest/first date where this condition holds?

This is what my output would ideally look like:

  iso_code continent     location           date       total_cases new_cases stringency_index population
  <chr>    <chr>         <chr>              <chr>            <dbl>     <dbl>            <dbl>      <dbl>
1 ABW      North America Aruba              2020-03-24          12         8             44.4     106766
2 AFG      Asia          Afghanistan        2020-03-16          16         6             38.9     38928341
3 AGO      Africa        Angola             2020-04-19          24         5             90.7     32866268
4 ALB      Europe        Albania            2020-03-13          23        12             78.7     2877800
5 AND      Europe        Andorra            2020-03-17          14         9             31.4     77265
6 ARE      Asia          Utd. Arab Emirates 2020-02-28          19         6              8.3     9890400

You could try `df_filtered <- df %>% filter(new_cases >= 5 & date==min(date))` — Duck, Sep 02 '20 at 17:54
@Duck - missing a `group_by`, and that compound condition will only be true if the country has at least 5 cases on it's first date, not the first date the condition is true. — Gregor Thomas, Sep 02 '20 at 20:56
@Duck I tried that, but it only returns the lowest date out the entire dataset — asd7, Sep 02 '20 at 20:56
@GregorThomas In that case it should be better to use `first()` — Duck, Sep 02 '20 at 20:57
@asd7 As no significative data is provided try this `df %>% group_by(continent) %>% filter(new_cases >= 5) %>% filter(date==min(date))` or `df %>% group_by(continent) %>% filter(new_cases >= 5) %>% filter(date==first(date)) ` — Duck, Sep 02 '20 at 21:03
@Duck using the second code you suggested and substituting "iso_code" for "continent" in group_by worked, thank you! How should I provide data for future reference? — asd7, Sep 02 '20 at 21:16

score 1 · Accepted Answer · answered Sep 02 '20 at 20:58

1

Try this:

df %>% 
  group_by(iso_code) %>%  ## within each country (group)
  filter(new_cases >= 5) %>%  ## keep rows where there are at least 5 cases
  slice_min(date, n = 1, with_ties = FALSE)  ## then keep the row with the smallest date

answered Sep 02 '20 at 20:58

Gregor Thomas

136,190
20
167
294

score 0 · Answer 2 · answered Sep 02 '20 at 21:08

I got it to work with the following code:

df_filtered <- df %>% filter(new_cases >= 5) #filter all new_cases with at least 5

df_sorted <- df_filtered %>%                 #group by country and arrange by date,
  group_by(iso_code) %>%                     #then get the first row of every 
  arrange(date) %>%                          #group 
  slice(1L)

Inspired by the answer to this question Earliest Date for each id in R

How can I select rows from dataframe based on date and value?

2 Answers2