Scraping data from a site with multiple urls

Question

I've been trying to scrape a list of companies off of the site -Company list401.html. I can scrape the single table off of this page with this code:

>fileurl = read_html("http://archive.fortune.com
/magazines/fortune/fortune500_archive/full/2005/1")
> content = fileurl %>%
+ html_nodes(xpath = '//*[@id="MagListDataTable"]/table[2]') %>%
+ html_table()
>contentframe = data.frame(content)
> view(contentframe)

However, I need all of the data that goes back to 1955 from 2005 as well as a list of the companies 1 through 500, whereas this list only shows 100 companies and a single year at a time. I've recognized that the only changes to the url are "...fortune500_archive/full/" YEAR "/" 1, 201,301, or 401 (per range of companies showing).

I also understand that I have to create a loop that will automatically collect this data for me as opposed to me manually replacing the url after saving each table. I've tried a few variations of sapply functions from reading other posts and watching videos, but none will work for me and I'm lost.

This is one situation where an old fashioned for loop is perfectly acceptable. — cory, Aug 10 '16 at 19:33
You could start here http://stackoverflow.com/q/5963269/2824732 — Robert, Aug 10 '16 at 19:33
The web query is the time limiting step here. In this case I would just use a FOR loop. — Dave2e, Aug 10 '16 at 20:12
Welcome to your violation of ToS item #7: https://subscription.timeinc.com/storefront/privacy/fortune/privacy_terms_service.html (and your encouragement for others to also violate said ToS). — hrbrmstr, Aug 10 '16 at 20:45

score 0 · Accepted Answer · answered Aug 10 '16 at 20:26

A few suggestions to get you started. First, it may be useful to write a function to download and parse each page, e.g.

getData <- function(year, start) {
  url <- sprintf("http://archive.fortune.com/magazines/fortune/fortune500_archive/full/%d/%d.html", 
    year, start)
  fileurl <- read_html(url)
  content <- fileurl %>%
    html_nodes(xpath = '//*[@id="MagListDataTable"]/table[2]') %>%
    html_table()
  contentframe <- data.frame(content)
}

We can then loop through the years and pages using lapply (as well as do.call(rbind, ...) to rbind all 5 dataframes from each year together). E.g.:

D <- lapply(2000:2005, function(year) {
  do.call(rbind, lapply(seq(1, 500, 100), function(start) {
    cat(paste("Retrieving", year, ":", start, "\n"))
    getData(year, start)
    }))
})

Scraping data from a site with multiple urls

1 Answers1