-1

I wish to download an online folder using Windows 10 on my Dell laptop. In this example the folder I wish to download is named Targetfolder. I am trying to use the Command Window but also am wondering whether there is a simple solution in R. I have included an image at the bottom of this post showing the target folder. I should add that Targetfolder includes a file and multiple subfolders containing files. Not all files have the same extension. Also, please note this is a hypothetical site. I did not want to include the real site for privacy issues.

EDIT

Here is a real site that can serve as a functional, reproducible example. The folder rel2020 can take the place of the hypothetical Targetfolder:

https://www2.census.gov/geo/docs/maps-data/data/rel2020/

None of the answers here seem to work with Targetfolder:

How to download HTTP directory with all files and sub-directories as they appear on the online files/folders list?

Below are my attempts based on answers posted at the link above and the result I obtained:

Attempt One

lftp -c 'mirror --parallel=300 https://www.examplengo.org/datadisk/examplefolder/userdirs/user3/Targetfolder/ ;exit'

Returned:

lftp is not recognized as an internal or external command, operable program or batch file.

Attempt Two

wget -r -np -nH --cut-dirs=3 -R index.html https://www.examplengo.org/datadisk/examplefolder/userdirs/user3/Targetfolder/

Returned:

wget is not recognized as an internal or external command, operable program or batch file.

Attempt Three

https://sourceforge.net/projects/visualwget/files/latest/download

VisualWget returned Unsupported scheme next to the url.

enter image description here

user438383
  • 5,716
  • 8
  • 28
  • 43
Mark Miller
  • 12,483
  • 23
  • 78
  • 132
  • Try `httr::HEAD("https://www.examplengo.org")` and see you have a bad https address. – Rui Barradas May 29 '22 at 19:46
  • That is a R instruction, to be run at an R prompt, the question is tagged R. You must first install package `httr` with `install.packages("httr")`, then still at an R prompt, run the instruction above. – Rui Barradas May 29 '22 at 22:34
  • @RuiBarradas Thank you. When I do that I get the following five lines: `Response [https://www.examplengo.org/login.php] Date: 2022-05-29 23:46 Status: 200 Content-Type: text/html ` – Mark Miller May 29 '22 at 23:50
  • That site doesn't seem to be available in my country, I'm getting `Error in curl::curl_fetch_memory(url, handle = handle) : Could not resolve host: www.examplengo.org`. And in the browser it's also not available, so it's not an package `httr` error. – Rui Barradas May 30 '22 at 03:09
  • @RuiBarradas I should have stated in my post that the site is hypothetical. It does not exist anywhere. I did not want to post the real site for privacy issues. However, I have now added a real site to the post: https://www2.census.gov/geo/docs/maps-data/data/rel2020/ where `rel2020` takes the place of my hypothetical `Targetfolder`. – Mark Miller May 30 '22 at 05:01

1 Answers1

1

Here is a way with packages httr and rvest.
First get the folders where the files are from the link.
Then loop through the folders with Map, getting the filenames and downloading them in a lapply loop.
If errors such as time out conditions occur, they will be trapped in tryCatch. The last code lines will tell if and where there were errors.

Note: I only downloaded from folders[1:2], in the Map below change this to folders.

suppressPackageStartupMessages({
  library(httr)
  library(rvest)
  library(dplyr)
})

link <- "https://www2.census.gov/geo/docs/maps-data/data/rel2020/"

page <- read_html(link)

folders <- page %>%
  html_elements("a") %>%
  html_attr("href") %>%
  .[8:14] %>%
  paste0(link, .)

files_txt <- Map(\(x) {
  x %>%
    read_html() %>%
    html_elements("a") %>%
    html_attr("href") %>%
    grep("\\.txt$", ., value = TRUE) %>%
    paste0(x, .) %>%
    lapply(\(y) {
      tryCatch(
        download.file(y, destfile = file.path("~/Temp", basename(y))),
        error = function(e) e
      )
    })
}, folders[1:2])

err <- sapply(unlist(files_txt, recursive = FALSE), inherits, "error")
lapply(unlist(files_txt, recursive = FALSE)[err], simpleError)
Rui Barradas
  • 70,273
  • 8
  • 34
  • 66
  • Thank you. I had no idea the solution would be so complex. I suspect I will have to change the `grep` statement because my actual target folder contains files with multiple extensions. There also is at least one `.txt` file that is not in any subfolder. But this code is very nice. – Mark Miller May 30 '22 at 14:21
  • @MarkMiller The main problem is not the 1st pipe where the folders names are determined, it's the 2nd. In each folder there are many files and `html_attr` returns them all so they must be downloaded in an inner loop, `lapply`. `grep`, `paste0` and `file.path` are just to put the filenames together, the important part is to get those names. – Rui Barradas May 30 '22 at 14:31