0

I am trying to check periodically for date of latest downloadable files that are added to page https://github.com/mrc-ide/global-lmic-reports/tree/master/data, where the file names are like 2021-05-22_v8.csv.zip

There is a code snip mentioned in Using R to scrape the link address of a downloadable file from a web page? that can be used with a tweak, and identifies the date of the first or earliest downloadable file on a web page, shown below.

library(rvest)
library(stringr)
library(xml2)

page <- read_html("https://github.com/mrc-ide/global-lmic-reports/tree/master/data")

page %>%
  html_nodes("a") %>%       # find all links
  html_attr("href") %>%     # get the url
  str_subset("\\.csv.zip") %>% # find those that end in .csv.zip
  .[[1]]                    # look at the first one

Returns: [1] "/mrc-ide/global-lmic-reports/blob/master/data/2020-04-28_v1.csv.zip"

The question is what would be the code to identify the date of the latest .csv.zip file? E.g., 2021-05-22_v8.csv.zip as of checked on 2021-06-01.

The purpose is that if that date (i.e., 2021-05-22) is > latest update I have created in https://github.com/pourmalek/covir2 (e.g. IMPE 20210522 in https://github.com/pourmalek/covir2/tree/main/20210528), then a new update needs to be created.

1 Answers1

1

You can convert the links to date and use which.max to get the latest one.

library(rvest)
library(stringr)
library(xml2)

page <- read_html("https://github.com/mrc-ide/global-lmic-reports/tree/master/data")

page %>%
  html_nodes("a") %>%       # find all links
  html_attr("href") %>%     # get the url
  str_subset("\\.csv.zip") -> tmp # find those that end in .csv.zip

tmp[tmp %>%
  basename() %>%
  substr(1, 10) %>%
  as.Date() %>% which.max()]

#[1] "/mrc-ide/global-lmic-reports/blob/master/data/2021-05-22_v8.csv.zip"

To get the data the latest date you can use -

tmp %>%
  basename() %>%
  substr(1, 10) %>%
  as.Date() %>% max()

#[1] "2021-05-22"
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213