-3

I've been trying to scrape a list of companies off of the site -Company list401.html. I can scrape the single table off of this page with this code:

>fileurl = read_html("http://archive.fortune.com
/magazines/fortune/fortune500_archive/full/2005/1")
> content = fileurl %>%
+ html_nodes(xpath = '//*[@id="MagListDataTable"]/table[2]') %>%
+ html_table()
>contentframe = data.frame(content)
> view(contentframe)

However, I need all of the data that goes back to 1955 from 2005 as well as a list of the companies 1 through 500, whereas this list only shows 100 companies and a single year at a time. I've recognized that the only changes to the url are "...fortune500_archive/full/" YEAR "/" 1, 201,301, or 401 (per range of companies showing).

I also understand that I have to create a loop that will automatically collect this data for me as opposed to me manually replacing the url after saving each table. I've tried a few variations of sapply functions from reading other posts and watching videos, but none will work for me and I'm lost.

  • 2
    This is one situation where an old fashioned for loop is perfectly acceptable. – cory Aug 10 '16 at 19:33
  • You could start here http://stackoverflow.com/q/5963269/2824732 – Robert Aug 10 '16 at 19:33
  • The web query is the time limiting step here. In this case I would just use a FOR loop. – Dave2e Aug 10 '16 at 20:12
  • 1
    Welcome to your violation of ToS item #7: https://subscription.timeinc.com/storefront/privacy/fortune/privacy_terms_service.html (and your encouragement for others to also violate said ToS). – hrbrmstr Aug 10 '16 at 20:45

1 Answers1

0

A few suggestions to get you started. First, it may be useful to write a function to download and parse each page, e.g.

getData <- function(year, start) {
  url <- sprintf("http://archive.fortune.com/magazines/fortune/fortune500_archive/full/%d/%d.html", 
    year, start)
  fileurl <- read_html(url)
  content <- fileurl %>%
    html_nodes(xpath = '//*[@id="MagListDataTable"]/table[2]') %>%
    html_table()
  contentframe <- data.frame(content)
}

We can then loop through the years and pages using lapply (as well as do.call(rbind, ...) to rbind all 5 dataframes from each year together). E.g.:

D <- lapply(2000:2005, function(year) {
  do.call(rbind, lapply(seq(1, 500, 100), function(start) {
    cat(paste("Retrieving", year, ":", start, "\n"))
    getData(year, start)
    }))
})
Weihuang Wong
  • 12,868
  • 2
  • 27
  • 48