4

I would like to apply a loop to scrape data from multiple webpages in R. I am able to scrape the data for one webpage, however when I attempt to use a loop for multiple pages, I get a frustrating error. I have spent hours tinkering, to no avail. Any help would be greatly appreciated!!!

This works:

###########################
# GET COUNTRY DATA
###########################

library("rvest")

site <- paste("http://www.countryreports.org/country/","Norway",".htm", sep="")
site <- html(site)

stats<-
    data.frame(names =site %>% html_nodes(xpath="//*/td[1]") %>% html_text() ,
         facts =site %>% html_nodes(xpath="//*/td[2]") %>% html_text() ,
         stringsAsFactors=FALSE)

stats$country <- "Norway"
stats$names   <- gsub('[\r\n\t]', '', stats$names)
stats$facts   <- gsub('[\r\n\t]', '', stats$facts)
View(stats)

However, when I attempt to write this in a loop, I receive an error

###########################
# ATTEMPT IN A LOOP
###########################

country<-c("Norway","Sweden","Finland","France","Greece","Italy","Spain")

for(i in country){

site <- paste("http://www.countryreports.org/country/",country,".htm", sep="")
site <- html(site)

stats<-
data.frame(names =site %>% html_nodes(xpath="//*/td[1]") %>% html_text() ,
         facts =site %>% html_nodes(xpath="//*/td[2]") %>% html_text() ,
       stringsAsFactors=FALSE)

stats$country <- country
stats$names   <- gsub('[\r\n\t]', '', stats$names)
stats$facts   <- gsub('[\r\n\t]', '', stats$facts)

stats<-rbind(stats,stats)
stats<-stats[!duplicated(stats),]
}

Error:

Error: length(url) == 1 is not TRUE
In addition: Warning message:
In if (grepl("^http", x)) { :
  the condition has length > 1 and only the first element will be used
Chris L
  • 338
  • 1
  • 4
  • 15
  • Same result here. I tried this code, and got the same error message even on the non-loop that worked! > length(site) [1] 7 > stopifnot(length(site) == 1) Error: length(site) == 1 is not TRUE – lawyeR Jan 08 '15 at 04:26
  • 1
    On this line: `site <- paste("http://www.countryreports.org/country/",country,".htm", sep="")` you are using `country`, which is, on the loop version, a character vector with all your countries. You probably want `i` which is one element of your country vector. – zelite Jan 08 '15 at 09:17
  • zelite - that got me alot closer - thank you. – Chris L Jan 09 '15 at 02:57
  • Thanks to both of you for the help. I'll add the final working code for reference - hope it helps someone! – Chris L Jan 09 '15 at 03:12

3 Answers3

5

Final working code:

###########################
# THIS WORKS!!!!
###########################

country<-c("Norway","Sweden","Finland","France","Greece","Italy","Spain")

for(i in country){

site <- paste("http://www.countryreports.org/country/",i,".htm", sep="")
site <- html(site)

stats<-
data.frame(names =site %>% html_nodes(xpath="//*/td[1]") %>% html_text() ,
     facts =site %>% html_nodes(xpath="//*/td[2]") %>% html_text() ,
       stringsAsFactors=FALSE)

stats$nm <- i
stats$names   <- gsub('[\r\n\t]', '', stats$names)
stats$facts   <- gsub('[\r\n\t]', '', stats$facts)
#stats<-stats[!duplicated(stats),]
all<-rbind(all,stats)

}
 View(all)
Chris L
  • 338
  • 1
  • 4
  • 15
  • 1
    Does this really work for you? Aiming to do a similar thing so ran your code and receive the following error:Error in rep(xi, length.out = nvar) : attempt to replicate an object of type 'builtin'. Did you initiate "all" somewhere prior? – Z_D May 24 '15 at 20:53
1

Just initalize empty dataframe before loop. I have done this problem and following code works fine for me.

country<-c("Norway","Sweden","Finland","France","Greece","Italy","Spain")
df <- data.frame(names = character(0),facts = character(0),nm = character(0))

for(i in country){

  site <- paste("http://www.countryreports.org/country/",i,".htm", sep="")
  site <- html(site)

  stats<-
    data.frame(names =site %>% html_nodes(xpath="//*/td[1]") %>% html_text() ,
               facts =site %>% html_nodes(xpath="//*/td[2]") %>% html_text() ,
               stringsAsFactors=FALSE)

  stats$nm <- i
  stats$names   <- gsub('[\r\n\t]', '', stats$names)
  stats$facts   <- gsub('[\r\n\t]', '', stats$facts)
  #stats<-stats[!duplicated(stats),]
  #all<-rbind(all,stats)
  df <- rbind(df, stats)
  #all <- merge(Output,stats)

}
View(df)
Premal
  • 133
  • 3
  • 12
0

This is what I did. It is not the best solution, but you will get an output. Also this is only a workaround. I do not recommend you write a table output into a file while running a loop. Here you go. After the output is generated from stats,

output<-rbind(stats,i)

and then write the table to,

write.table(output, file = "D:\\Documents\\HTML\\Test of loop.csv", row.names = FALSE, append = TRUE, sep = ",")

#then close the loop
}

Good luck

Abdulla Nilam
  • 36,589
  • 17
  • 64
  • 85
SKD
  • 58
  • 8