12

I'm attempting to download a png image from a secure site through R.

To access the secure site I used Rvest which worked well.

So far I've extracted the URL for the png image.

How can I download the image of this link using rvest?

Functions outside of the rvest function return errors due to not having permission.

Current attempts

library(rvest)
uastring <- "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36"
session <- html_session("https://url.png", user_agent(uastring))
form <- html_form(session)[[1]]
form <- set_values(form, username = "***", password="***", cookie_checkbox= TRUE)
session<-submit_form(session, form)
session2<-jump_to(session, "https://url.png")

## Status 200 using rvest, sucessfully accsessed page.    
session 
<session> https://url.png
  Status: 200
  Type:   image/png
  Size:   438935

## Using download.file returns status 403, page unable to open.
download.file("https://url.png", destfile = "t.png")
    cannot open: HTTP status was '403 Forbidden'

Have tried readPNG and download.file on the url, both of which failed due to not having permission to download from a authenticated secure site (error: 403), hence why I used rvest in the first place.

G. Gip
  • 337
  • 1
  • 4
  • 10
  • Please make a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) – Scott Mar 24 '16 at 15:21
  • You might find [this](http://stackoverflow.com/questions/29110903/how-to-download-and-display-an-image-from-an-url-in-r) helpful – Scott Mar 24 '16 at 15:24
  • The problem he's having isn't with downloading a file, it's with the authentication part. He may have to use `httr::GET` with a cookie or other authentication mechanism. – cory Mar 24 '16 at 15:54
  • @cory, it is indeed an authentication issue. I've used rvest to access the site successfully, however functions outside of Rvest still fail to access the site and download the PNG – G. Gip Mar 24 '16 at 16:10

3 Answers3

12

Here's one example to download the R logo into the current directory.

library(rvest)
url <- "https://www.r-project.org"
imgsrc <- read_html(url) %>%
  html_node(xpath = '//*/img') %>%
  html_attr('src')
imgsrc
# [1] "/Rlogo.png"

# side-effect!
download.file(paste0(url, imgsrc), destfile = basename(imgsrc))

EDIT

Since authentication is involved, Austin's suggestion of using a session is certainly required. Try this:

library(rvest)
library(httr)
sess <- html_session(url)
imgsrc <- sess %>%
  read_html() %>%
  html_node(xpath = '//*/img') %>%
  html_attr('src')
img <- jump_to(sess, paste0(url, imgsrc))

# side-effect!
writeBin(img$response$content, basename(imgsrc))
r2evans
  • 141,215
  • 6
  • 77
  • 149
  • I've already attempted using download.file with the URL directly, the issue is that it returns `error:403 forbidden` and is not able to open the link. Hence why I needed to use rvest initially to get a response 200. – G. Gip Mar 24 '16 at 15:52
  • Your post said that `download.file` wasn't working because it was a secure site. My example is accessing a secure site and works. Is there authentication involved, or is it just session-tracking or url-referer problems? – r2evans Mar 24 '16 at 15:54
  • Apologies, I meant there is authentication involved (as well as being secure). Using rvest I've managed to submit the authentication and got a status 200, but when trying to use functions outside of the R package to download it returns error 403. – G. Gip Mar 24 '16 at 15:58
  • @G.Gip, I am inferring that this isn't working for you based on your question edit. However, your edit is still using `download.file`, which my second method changed. Your image is stored in your `session` variable (if the form submission returns just an image) or `session2` (otherwise), so you should be able to use `session$response$content` as I demonstrated in my answer. If this doesn't work, perhaps you can tell us what happens when you do it? (Regardless, stop using `download.file`, it obviously won't meet your needs.) – r2evans Mar 24 '16 at 17:06
  • It worked, brilliant response! Do you know if it would possible to pass multiple sessions / PNG urls to writeBin()? My idea of using R was to use it as a batch downloader of the images. – G. Gip Mar 24 '16 at 17:22
  • Because `writeBin` is taking the full body of the response, you would need to do multiple calls to `jump_to(session, ...)` with a `writeBin` for each. It may be feasible with pipelining, but that may be beyond what `rvest`/`httr` can do (and certainly beyond my knowledge at the moment). You might be able to reuse the session from the first call and make direct (non-form) connections for the others, assuming they are all behind the same auth. In that case, it should be relatively trivial to automate it ("batch" is up for interpretation). – r2evans Mar 24 '16 at 17:25
6

Try this example below:

library(rvest); library(dplyr)

url <- "http://www.calacademy.org/explore-science/new-discoveries-an-alaskan-butterfly-a-spider-physicist-and-more"
webpage <- html_session(url)
link.titles <- webpage %>% html_nodes("img")

img.url <- link.titles[13] %>% html_attr("src")

download.file(img.url, "test.jpg", mode = "wb")

You now have "test.jpg" which is the picture:enter image description here

Dharman
  • 30,962
  • 25
  • 85
  • 135
  • Hi Austin. I've tried a similar method, unfortunately `download.file()` returns error 403: forbidden as authentication is required, hence why I required rvest to access the page. – G. Gip Mar 24 '16 at 16:16
  • 1
    `rvest` imports `%>%`, so you don't need `dplyr` for this answer. Other than showing the jpeg itself, was this supposed to do anything differently than my answer? – r2evans Mar 24 '16 at 17:47
0

Works with several queries, renames the file, and registers the link in a Txt file. For queries with space use a + in between, see in the example.

library(rvest)
library(magrittr)
library(httr)

search_and_download_images <- function(query, size, n_images = 2, output_directory = "downloaded_images") {
  # Prepare the query for Google Images search
  search_url <- paste0("https://www.google.com/search?q=", query, "&tbm=isch&tbs=isz:m")
  
  # Scrape image URLs from Google Images
  image_links <- search_url %>%
    read_html() %>%
    html_nodes("img") %>%
    html_attr("src") %>%
    na.omit()
  
  # Keep the desired number of image URLs
  image_links <- image_links[1:min(n_images, length(image_links))]
  
  # Create the output directory if it doesn't exist
  if (!dir.exists(output_directory)) {
    dir.create(output_directory)
  }
  
  # Download and save images
  for (i in 2:length(image_links)) {
    img_url <- image_links[i]
    response <- GET(img_url)
    
    if (response$status_code == 200) {
      file_ext <- tools::file_ext(img_url)
      if (file_ext == "") {
        file_ext <- "jpg"
      }
      
      # Save the image
      img_filename <- file.path(output_directory, paste0(query, "_", i, ".", file_ext))
      writeBin(content(response, "raw"), img_filename)
      
      # Save image information (URL, local filename)
      info_filename <- file.path(output_directory, paste0(query, "_", i, "_info.txt"))
      cat(paste("URL:", img_url, "\n"), file = info_filename)
      cat(paste("Local file:", img_filename, "\n"), file = info_filename, append = TRUE)
    } else {
      cat(paste("Failed to download image", i, "for query", query, "\n"))
    }
  }
}


# research_terms <- c("Water erosion", "wind erosion")
research_terms <- c("Water+erosion", "wind+erosion")
desired_size <- "medium"
number_of_images <- 10

for (term in research_terms) {
  search_and_download_images(term, size = desired_size, n_images = number_of_images)
}