1

I am working on a project dealing with Covid-19 Data. I have data that is updated daily from Our World in Data. The csv file is here: https://raw.githubusercontent.com/owid/covid-19-data/9ee33ac73942b2e37eb04014bf2a7a17a83998cf/public/data/owid-covid-data.csv

The data has several columns country, date, cases, etc.

What I am interested in is saving only the most recent row for each country and removing everything else. What would be the best way to go about this?

Currently, my code looks like this. I have recently made the transition to R from another program, so guidance is helpful even if this is a dumb question!

world.data < -read.csv("https://raw.githubusercontent.com/owid/covid-19-data/9ee33ac73942b2e37eb04014bf2a7a17a83998cf/public/data/owid-covid-data.csv")
world.data$iso_code < -NULL# Remove Country ISO Code
world.data$date < -as.Date(world.data$date, "%Y-%m-%d")

library(ggplot2)
jmt93
  • 11
  • 1

2 Answers2

1

Here is a solution that uses the tidyverse. We group the data by location and select the maximum value of date.

rawData <- "https://raw.githubusercontent.com/owid/covid-19-data/9ee33ac73942b2e37eb04014bf2a7a17a83998cf/public/data/owid-covid-data.csv"
download.file(rawData,"./data/owid_covid_data.csv")

data <- read.csv("./data/owid_covid_data.csv",header = TRUE, stringsAsFactors = FALSE)

library(dplyr)
data %>% group_by(location) %>% 
     filter(date == max(date)) -> filteredData

...and the first few rows of output:

> head(filteredData[1:4])
# A tibble: 6 x 4
# Groups:   location [6]
  iso_code location    date       total_cases
  <chr>    <chr>       <chr>            <int>
1 ABW      Aruba       2020-04-19          96
2 AFG      Afghanistan 2020-04-19         908
3 AGO      Angola      2020-04-19          24
4 AIA      Anguilla    2020-04-19           3
5 ALB      Albania     2020-04-19         548
6 AND      Andorra     2020-04-19         704
> 
Len Greski
  • 10,505
  • 2
  • 22
  • 33
  • Great answer; to clarify, does `filter(date == max(date))` filter by the max date for each row? Or does it pick the max date for the column and fill in any entries that have not been updated by NA? – jared_mamrot Apr 19 '20 at 23:14
  • 1
    @jpmam1 - since `group_by(location)` precedes `filter()` in the pipeline, `filter()` processes for each distinct `location`, returning the row where `date` equals the maximum value of `date` for that location. So, if Aruba's most recent date was 2020-04-19 but Angola's most recent date was 2020-04-18, the row returned for Angola would be the one from the 18th of April. The `dplyr` package is very powerful, but is hard to learn because it contains hundreds of functions. In fact, the `top_n(1,date)` from the other answer is simply another way to specify `filter(date == max(date))`. – Len Greski Apr 19 '20 at 23:24
0

Try something like:

library(tidyverse)
world.data %>% group_by(location) %>% top_n(1,date)

or without the pipe

top_n(group_by(world.data, location), 1, date)
Martin Gal
  • 16,640
  • 5
  • 21
  • 39