Archive.org provides Wayback CDX API for looking up captures, it returns timestamps along with original urls in tabular form or JSON. Such queries can be made with read.table()
alone, links to specific captures can then be constructed from timestamp
and original
columns and base URL.
read.table("https://web.archive.org/cdx/search/cdx?url=covid.cdc.gov/covid-data-tracker/&limit=5",
col.names = c("urlkey","timestamp","original","mimetype","statuscode","digest","length"),
colClasses = "character")
#> urlkey timestamp
#> 1 gov,cdc,covid)/covid-data-tracker 20200824224244
#> 2 gov,cdc,covid)/covid-data-tracker 20200825013347
#> 3 gov,cdc,covid)/covid-data-tracker 20200825024622
#> 4 gov,cdc,covid)/covid-data-tracker 20200825042657
#> 5 gov,cdc,covid)/covid-data-tracker 20200825050018
#> original mimetype statuscode
#> 1 https://covid.cdc.gov/covid-data-tracker/ text/html 200
#> 2 https://covid.cdc.gov/covid-data-tracker/ text/html 200
#> 3 https://covid.cdc.gov/covid-data-tracker/ text/html 200
#> 4 https://covid.cdc.gov/covid-data-tracker/ text/html 200
#> 5 https://covid.cdc.gov/covid-data-tracker/ text/html 200
#> digest length
#> 1 APS6SXNXBXCJU3P4N23WH4XCVDVZQYAD 5342
#> 2 XFEMFRGXIPWM4K5F6CBIYDSOFIGCUBQZ 5370
#> 3 TVQKZHRM452CFX4RIORWGSMK5PG3PAPR 5343
#> 4 XZDLPJ6EQIXEO4SUFQTFEX4S6SF7O4GT 5370
#> 5 A4J63TFU7HMZQE5KFTSLBD6EFNZ4IBZ4 5373
To make it a bit more convenient to work with, we can customize API request with httr
/ httr2
, for example, and pass the response through readr
/ dplyr
/ lubridate
pipeline:
library(dplyr)
library(httr2)
library(readr)
archive_links <- request("https://web.archive.org/cdx/search/cdx") %>%
# set query parameters
req_url_query(
url = "covid.cdc.gov/covid-data-tracker/",
filter = "statuscode:200", # include only succesful captures where HTTP status code was 200
collapse = "timestamp:8", # limit to 1 capt. per day by comparing first 8 digits of timestamp: <20200824>224244
limit = 10, # limit the number of returned values
# output = "json" # request json output, includes column names
) %>%
req_perform() %>%
# pass http response string to read_table() for pasring
resp_body_string() %>%
read_table(col_names = c("urlkey","timestamp","original","mimetype","statuscode","digest","length"),
col_types = cols_only(timestamp = "c",
original = "c",
mimetype = "c",
length = "i")) %>%
mutate(link = paste("https://web.archive.org/web", timestamp, original, sep = "/") %>% tibble::char(shorten = "front"),
timestamp = lubridate::ymd_hms(timestamp)) %>%
select(timestamp, link, length)
archive_links
#> # A tibble: 10 × 3
#> timestamp link length
#> <dttm> <char> <int>
#> 1 2020-08-24 22:42:44 …4224244/https://covid.cdc.gov/covid-data-tracker/ 5342
#> 2 2020-08-25 01:33:47 …5013347/https://covid.cdc.gov/covid-data-tracker/ 5370
#> 3 2020-08-26 02:37:09 …6023709/https://covid.cdc.gov/covid-data-tracker/ 5371
#> 4 2020-08-27 01:05:48 …7010548/https://covid.cdc.gov/covid-data-tracker/ 5703
#> 5 2020-08-28 02:23:26 …8022326/https://covid.cdc.gov/covid-data-tracker/ 31177
#> 6 2020-08-29 02:01:27 …9020127/https://covid.cdc.gov/covid-data-tracker/ 31237
#> 7 2020-08-30 00:06:31 …0000631/https://covid.cdc.gov/covid-data-tracker/ 31218
#> 8 2020-08-31 00:18:29 …1001829/https://covid.cdc.gov/covid-data-tracker/ 31640
#> 9 2020-09-01 02:30:30 …1023030/https://covid.cdc.gov/covid-data-tracker/ 31257
#> 10 2020-09-02 04:08:31 …2040831/https://covid.cdc.gov/covid-data-tracker/ 31654
# first capture:
archive_links$link[1]
#> <pillar_char<[1]>
#> [1] https://web.archive.org/web/20200824224244/https://covid.cdc.gov/covid-data-tracker/
Created on 2023-07-02 with reprex v2.0.2
There are also Archive.org client libraries for R, e.g.
https://github.com/liserman/archiveRetriever & https://hrbrmstr.github.io/wayback/ , though the query interface for the first is bit odd, and the other is currently not available through CRAN.