1

I'm trying to write a for loop that will loop through many websites and extract a few elements, and store the results in a table in R. Here's my go so far, just not sure how to start the for loop, or copy all results into one variable to be exported later.

library("dplyr")
library("rvest")
library("leaflet")
library("ggmap")


url <- c(html("http://www.webiste_name.com/")

agent <- html_nodes(url,"h1 span")
fnames<-html_nodes(url, "#offNumber_mainLocContent span")
address <- html_nodes(url,"#locStreetContent_mainLocContent")

scrape<-t(c(html_text(agent),html_text(fnames),html_text(address)))


View(scrape)
CHopp
  • 17
  • 1
  • 1
  • 6

2 Answers2

2

Given that your question isn't fully reproducible, here's a toy example that loops through three URLs (Red Socks, Jays and Yankees):

library(rvest)

# teams
teams <- c("BOS", "TOR", "NYY")

# init
df <- NULL

# loop
for(i in teams){
    # find url
    url <- paste0("http://www.baseball-reference.com/teams/", i, "/")
    page <- read_html(url)
    # grab table
    table <- page %>%
        html_nodes(css = "#franchise_years") %>%
        html_table() %>%
        as.data.frame()
    # bind to dataframe
    df <- rbind(df, table)
}

# view captured data
View(df)

The loop works because it replaces i in paste0 with each team in sequence.

emehex
  • 9,874
  • 10
  • 54
  • 100
0

I would go with lapply.

The code would look something like this:

library("rvest")
library("dplyr")

#a vector of urls you want to scrape
URLs <- c("http://...1", "http://...2", ....)

df <- lapply(URLs, function(u){

      html.obj <- read_html(u)
      agent <- html_nodes(html.obj,"h1 span") %>% html_text
      fnames<-html_nodes(html.obj, "#offNumber_mainLocContent span") %>% html_text
      address <- html_nodes(html.obj,"#locStreetContent_mainLocContent") %>% html_text

     data.frame(Agent=agent, Fnames=fnames, Address=address)
})

df <- do.all(rbind, df)

View(df)
dimitris_ps
  • 5,849
  • 3
  • 29
  • 55
  • Worked great! How can I adjust to make sure the data from each scrape is stored in a seperate row? Right now its storing them all adjacent to eachother – CHopp Jul 29 '16 at 19:00
  • I am not sure i understand your question. within the data.frame of `lapply` you could have the following `data.frame(Agent=agent, Fnames=fnames, Address=address, URL=u) ` to have the corresponding url to every line generated – dimitris_ps Jul 31 '16 at 05:39
  • I figured it out, but another question, why would I get an error like this when trying to search a site "error: 'www.website.com' does not exist in current working directory" – CHopp Aug 02 '16 at 13:26