0

The following code works:

library(rvest)
library(plyr)

alaska <- c(1:49)

for (i in alaska) {

  url <- "http://www.50states.com/facts/alaska.htm"

  nodespath <- paste('//*[@id="content"]/div[1]/div[4]/ol/li[',i,']')

  alaskafacts <-  data.frame(facts =  url %>%   html() %>% 
                  html_nodes(xpath =nodespath) %>%   html_text())

 alaskafacts$nm <- i
 alaskafacts$facts <- alaskafacts$facts

 result <- rbind.fill(result,alaskafacts)
}

I'll get this as a result:

enter image description here

I know the loop is working because if I change the code to this:

alaska <- c(1:48)

I'll get this as a result:

enter image description here

The problem I'm running into is the loop writes over itself. I'm expecting 49 lines of facts -- I'm guessing the loop erases the previous fact and then write a new one -- the last fact will always be the fact in the data.frame.

I found an example here: How can I use a loop to scrape website data for multiple webpages in R? and the code posted above follows the code in the example. And then I referenced this example: here. And the code above, I think, follows it as well.

The rbind call I have at the bottom follows the two similar examples I found on SO, yet does not save as expected.

Any suggestions?


Community
  • 1
  • 1
jpf5046
  • 729
  • 7
  • 32

1 Answers1

2

You need to predefine the results variable before the for loop. Currently each time through the loop results is being overwritten. Try this:

library(rvest)
library(plyr)

alaska <- c(1:49)
result<-data.frame()
for (i in alaska) {

  url <- "http://www.50states.com/facts/alaska.htm"
....

There is a faster way to pull your requested information without using a for loop (and know the required length before hand). rvest is vectorized to allow all of the nodes to be pulled in one statement:

library(rvest)

url <- "http://www.50states.com/facts/alaska.htm"
page<-url %>%   read_html()

resultsarray<-html_text(html_nodes(page, "ol.stripedList li"))
  # "ol.stripedList li" is the html code for the list hierarchical 
  # li (list element underneath) an ol (ordered list) of class "stripedList" 

resultsarray is a array of strings with the 49 facts, I will allow you to convert it the desired dataframe.

Dave2e
  • 22,192
  • 18
  • 42
  • 50
  • works perfectly. how did you know 'ol.stripedList li' is what i needed? – jpf5046 Feb 01 '17 at 01:37
  • It is a matter of looking at the html code to identify html code blocks of interest. Sometimes a bit of trial and error. If you take a look at rvest's vignette, it explains how to use the 'selectorgadget" tool which simplifies the process. – Dave2e Feb 01 '17 at 02:52
  • would you still recommend NOT using a for loop if i wanted to run through all states where `states <- c("all states")` and the something similar to `for (i in all states)` -- I think a loop would be necessary if that was my game problem. What do you think? – jpf5046 Feb 01 '17 at 14:25
  • 1
    In that case, I would agree a loop would be easiest. There are ways to vectorize the calls to all 50 states. The time to load the web pages is the time limiting step, thus for only 50 pages the time savings in vectorizing the page loads is not worth the effort or loss in readability. – Dave2e Feb 01 '17 at 15:05