0

I am working on a web scraping program to search for data from multiple sheets. The code below is an example of what I am working with. I am able to get only the first sheet on this. It will be of great help if someone can point out where I am going wrong in my syntax.

jump <- seq(1, 10, by = 1)

site <- paste0("https://stackoverflow.com/search?page=",jump,"&tab=Relevance&q=%5bazure%5d%20free%20tier")


dflist <- lapply(site, function(i) {
   webpage <- read_html(i)
  draft_table <- html_nodes(webpage,'.excerpt')
  draft <- html_text(draft_table)
})



finaldf <- do.call(cbind, dflist) 

finaldf_10<-data.frame(finaldf)

View(finaldf_10)

Below is the link from where I need to scrape the data which has 127 pages.

[https://stackoverflow.com/search?q=%5Bazure%5D+free+tier][1]

As per the above code I am able to get data only from the first page and not the rest of the pages. There is no syntax error also. Could you please help me in finding out where I am going wrong.

Tanuvi
  • 79
  • 4
  • 12
  • Don't you need to use `do.call(rbind, dflist)` instead of `do.call(cbind, dflist)`? Furthermore, it is always good to include a description of what is going wrong (according to you) and include possible error-messages or incorrect output. – Jaap Jul 19 '17 at 06:25
  • 1
    An example of a similar problem: https://stackoverflow.com/questions/40525661/how-to-scrape-mutiple-tables-indexing-both-yearpage – Jaap Jul 19 '17 at 06:27
  • @Jaap Problem is in dflist loop – Tanuvi Jul 19 '17 at 06:27
  • I think you need to add an extra line in `dflist`: `return(draft)` (or just `draft`) – Jaap Jul 19 '17 at 06:32
  • I have added return(darft) as you suggested. Please see the revised code below but I'm still not getting the output. Please suggest. dflist <- lapply(site, function(i) { webpage <- read_html(i) draft_table <- html_nodes(webpage,'.excerpt') draft <- html_text(draft_table) return(draft) }) – Tanuvi Jul 19 '17 at 06:44
  • Please read [ask] and include a description of what you expect and what is going wrong. That will make it a lot easier for others to help you. – Jaap Jul 19 '17 at 06:50
  • Sure Jaap. I have edited my query. – Tanuvi Jul 19 '17 at 07:09

1 Answers1

1

Some websites put a security to prevent bulk scraping. I guess SO is one of them. More on that : https://github.com/JonasCz/How-To-Prevent-Scraping/blob/master/README.md

In fact, if you delay a little your calls, this will work. I've tried w/ 5 seconds Sys.sleep. I guess you can reduce it, but this may not work (I've tried with a 1 second Sys.sleep, that didn't work).

Here is a working code :

library(rvest)
library(purrr)

dflist <- map(.x = 1:10, .f = function(x) {
  Sys.sleep(5)
  url <- paste0("https://stackoverflow.com/search?page=",x,"&q=%5bazure%5d%20free%20tier")
  read_html(url) %>%
    html_nodes('.excerpt') %>%
    html_text() %>%
    as.data.frame()
}) %>% do.call(rbind, .)

Best,

Colin

Colin FAY
  • 4,849
  • 1
  • 12
  • 29