1

Consider a web page with many links for download data.

enter image description here

I would like to select the link for the "r" data format. The goal is to isolate them from the source code of the page (after I logged in).

conn = url("http://www.icpsr.umich.edu/icpsrweb/ICPSR/studies/35536?
searchSource=find-analyze-home&sortBy=&q=GSS")
html_code <- readLines(conn)
close(conn)
html_code

The result of html_code consists of thousand of apparently isolated lines of HTML code that are not visible in the R console, even if the data is correctly downloaded. I.e. if I copy the apparently empty board of the console to a text editor, the HTML code is visible. Because of that, I have an hard time in trying identify the information I need.

How can I better visualize the downloaded data?

Worice
  • 3,847
  • 3
  • 28
  • 49
  • This may be useful: http://stackoverflow.com/questions/1844829/how-can-i-read-and-parse-the-contents-of-a-webpage-in-r – Bram Vanroy Jan 21 '16 at 16:06

1 Answers1

3

One solution is to leverage the rvest package:

# install.packages("rvest")
library(rvest)

page <- read_html("http://www.icpsr.umich.edu/icpsrweb/ICPSR/studies/35536?searchSource=find-analyze-home&sortBy=&q=GSS")

# grab all of the links
links <- page %>%
  html_nodes("a") %>%
  html_attr("href")

# find the links that contain 'rdata'
contains_rdata <- grep("rdata", links)
links[contains_rdata]
# [1] "http://www.icpsr.umich.edu/cgi-bin/bob/terms2?study=35536&ds=&bundle=rdata&path=ICPSR" 
# [2] "http://www.icpsr.umich.edu/cgi-bin/bob/terms2?study=35536&ds=1&bundle=rdata&path=ICPSR"
# [3] "http://www.icpsr.umich.edu/cgi-bin/bob/terms2?study=35536&ds=2&bundle=rdata&path=ICPSR"
# [4] "http://www.icpsr.umich.edu/cgi-bin/bob/terms2?study=35536&ds=3&bundle=rdata&path=ICPSR"
# [5] "http://www.icpsr.umich.edu/cgi-bin/bob/terms2?study=35536&ds=4&bundle=rdata&path=ICPSR"

As pointed out by @hrbrmstr, a more robust and streamlined solution is to target only the anchor tags with R data links:

page %>%
  html_nodes("a[data-package = 'r']") %>%
  html_attr("href")

If you're not a fan of chaining, you can use:

html_attr(html_nodes(page, "a[data-package='r']"), "href")

JasonAizkalns
  • 20,243
  • 8
  • 57
  • 116