How to use R to identify most up-to-date downloadable file on a web page?

Question

I am trying to check periodically for date of latest downloadable files that are added to page https://github.com/mrc-ide/global-lmic-reports/tree/master/data, where the file names are like 2021-05-22_v8.csv.zip

There is a code snip mentioned in Using R to scrape the link address of a downloadable file from a web page? that can be used with a tweak, and identifies the date of the first or earliest downloadable file on a web page, shown below.

library(rvest)
library(stringr)
library(xml2)

page <- read_html("https://github.com/mrc-ide/global-lmic-reports/tree/master/data")

page %>%
  html_nodes("a") %>%       # find all links
  html_attr("href") %>%     # get the url
  str_subset("\\.csv.zip") %>% # find those that end in .csv.zip
  .[[1]]                    # look at the first one

Returns: [1] "/mrc-ide/global-lmic-reports/blob/master/data/2020-04-28_v1.csv.zip"

The question is what would be the code to identify the date of the latest .csv.zip file? E.g., 2021-05-22_v8.csv.zip as of checked on 2021-06-01.

The purpose is that if that date (i.e., 2021-05-22) is > latest update I have created in https://github.com/pourmalek/covir2 (e.g. IMPE 20210522 in https://github.com/pourmalek/covir2/tree/main/20210528), then a new update needs to be created.

Ronak Shah · Accepted Answer · 2021-06-02T02:55:16.170

1

You can convert the links to date and use which.max to get the latest one.

library(rvest)
library(stringr)
library(xml2)

page <- read_html("https://github.com/mrc-ide/global-lmic-reports/tree/master/data")

page %>%
  html_nodes("a") %>%       # find all links
  html_attr("href") %>%     # get the url
  str_subset("\\.csv.zip") -> tmp # find those that end in .csv.zip

tmp[tmp %>%
  basename() %>%
  substr(1, 10) %>%
  as.Date() %>% which.max()]

#[1] "/mrc-ide/global-lmic-reports/blob/master/data/2021-05-22_v8.csv.zip"

To get the data the latest date you can use -

tmp %>%
  basename() %>%
  substr(1, 10) %>%
  as.Date() %>% max()

#[1] "2021-05-22"

edited Jun 02 '21 at 02:55

answered Jun 02 '21 at 02:14

Ronak Shah

377,200
20
156
213

Thanks! This code perfectly returns the the **URL** of the latest .csv.zip file. Though a trivial task, wanted is the the **date** of the latest .csv.zip file, i.e., the 2021-05-22 string in this case. – Farshad Pourmalek Jun 02 '21 at 02:49
See the updated answer to get the date of latest .csv.zip file – Ronak Shah Jun 02 '21 at 02:55
This returns the exact wanted, the **date** of the latest file. Superb. Thanks again. Case solved. – Farshad Pourmalek Jun 02 '21 at 03:02

How to use R to identify most up-to-date downloadable file on a web page?

1 Answers1