distinct() not working for dataframe in R

Question

I'm trying to get specific values from a table regarding COVID 19 cases and I want to find the highest number of cases in each month (there are 4 months).

I created a new column for months and then reorganized the dataframe so that that for each month, all rows are in descending order by cases, however, when I try to use distinct for the months to get only one row per month, it does nothing. Am I using the function incorrectly? My Code

Would it make more sense to `slice_max(n = 1L, order_by = Cases, by = Month)`? — Seth, Aug 19 '23 at 03:05
Hey @Ano and welcome! Please don’t post code or data as images. For the best chance at getting help, try reading [how to make a great R reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) and editing your question accordingly. Good luck! — jpsmith, Aug 19 '23 at 03:06

Mark · Answer 1 · 2023-08-19T04:33:25.337

It's unclear to me if you want the highest number of cases for each month, or the highest number of cases for each country for each month. I went with the second, but included instructions if you would prefer to do the first.

The short version of my answer is- you were on the right path! I'm not sure what happened- maybe you had the right thing, but it was hard to tell if the dataset was being filtered, without looking at the reduced number of rows which R displays.

# since you didn't provide a sample dataset, we download Our World in Data's COVID-19 dataset and transform it into something similar
country_date_cases <- read_csv("https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/cases_deaths/new_cases.csv") |>
  pivot_longer(-date, names_to = "Country.Region", values_to = "Cases") |> 
  filter(year(date) == 2020 & between(month(date), 1, 4))

# example dataset
# A tibble: 29,274 × 3
   date       Country.Region      Cases
   <date>     <chr>               <dbl>
 1 2020-01-03 World                   0
 2 2020-01-03 Afghanistan             0
 3 2020-01-03 Africa                  0
 4 2020-01-03 Albania                 0
 5 2020-01-03 Algeria                 0
 6 2020-01-03 American Samoa          0
 7 2020-01-03 Andorra                 0
 8 2020-01-03 Angola                  0
 9 2020-01-03 Anguilla                0
10 2020-01-03 Antigua and Barbuda     0
# ℹ 29,264 more rows

country_date_cases |> 
  mutate(Month = month(date)) |> 
  arrange(Month, desc(Cases)) |> 
  distinct(Country.Region, Month, .keep_all = TRUE) # just using Month gets the largest value for each month, which in my dataset, which includes the world total, is obviously the world total. If we include Country.Region in there too, we get the largest value for each country in each month.

# A tibble: 984 × 4
   date       Country.Region      Cases Month
   <date>     <chr>               <dbl> <dbl>
 1 2020-01-31 World                2015     1
 2 2020-01-31 Asia                 2010     1
 3 2020-01-31 Upper middle income  1990     1
 4 2020-01-31 China                1984     1
 5 2020-01-31 High income            20     1
 6 2020-01-27 Thailand                7     1
 7 2020-01-25 North America           5     1
 8 2020-01-25 United States           5     1
 9 2020-01-26 Europe                  5     1
10 2020-01-26 European Union          5     1
# ℹ 974 more rows
# ℹ Use `print(n = ...)` to see more rows

# so this is equivalent to slice_max. However, with that said, as Seth has already commented, slice_max is the preferred option, as it is shorter, and more readable

country_date_cases |> 
  mutate(Month = month(date)) |>
  slice_max(Cases, n = 1, by = c(Country.Region, Month), with_ties = FALSE) # with slice_max, we also have the option of including ties, which I didn't do here. If you wanted the Country.Region with the largest number of cases in each month, you could change this to `by = Month`

Rui Barradas · Answer 2 · 2023-08-19T06:37:36.400

Here is a way. Data from Mark's answer.

In the code below I create a column Month like in the question but keeping the year, just in case this code is to be made more general purpose. Then, after grouping the data by country and month, summarise will keep the rows equal to the maxima of Case.
No need for time consuming sorting.

suppressPackageStartupMessages({
  library(tidyverse)
})

country_date_cases %>%
  mutate(Month = format(date, "%Y-%m")) %>%
  group_by(Country.Region, Month) %>%
  summarise(Cases = max(Cases, na.rm = TRUE), .groups = "drop")
#> # A tibble: 984 × 3
#>    Country.Region Month   Cases
#>    <chr>          <chr>   <dbl>
#>  1 Afghanistan    2020-01     0
#>  2 Afghanistan    2020-02     1
#>  3 Afghanistan    2020-03    32
#>  4 Afghanistan    2020-04   172
#>  5 Africa         2020-01     0
#>  6 Africa         2020-02     1
#>  7 Africa         2020-03   666
#>  8 Africa         2020-04  1971
#>  9 Albania        2020-01     0
#> 10 Albania        2020-02     0
#> # ℹ 974 more rows

^{Created on 2023-08-19 with reprex v2.0.2}

distinct() not working for dataframe in R

2 Answers2