Web Scraping multiple Links using R

Question

I am working on a web scraping program to search for data from multiple sheets. The code below is an example of what I am working with. I am able to get only the first sheet on this. It will be of great help if someone can point out where I am going wrong in my syntax.

jump <- seq(1, 10, by = 1)

site <- paste0("https://stackoverflow.com/search?page=",jump,"&tab=Relevance&q=%5bazure%5d%20free%20tier")


dflist <- lapply(site, function(i) {
   webpage <- read_html(i)
  draft_table <- html_nodes(webpage,'.excerpt')
  draft <- html_text(draft_table)
})



finaldf <- do.call(cbind, dflist) 

finaldf_10<-data.frame(finaldf)

View(finaldf_10)

Below is the link from where I need to scrape the data which has 127 pages.

[https://stackoverflow.com/search?q=%5Bazure%5D+free+tier][1]

As per the above code I am able to get data only from the first page and not the rest of the pages. There is no syntax error also. Could you please help me in finding out where I am going wrong.

Don't you need to use `do.call(rbind, dflist)` instead of `do.call(cbind, dflist)`? Furthermore, it is always good to include a description of what is going wrong (according to you) and include possible error-messages or incorrect output. — Jaap, Jul 19 '17 at 06:25
An example of a similar problem: https://stackoverflow.com/questions/40525661/how-to-scrape-mutiple-tables-indexing-both-yearpage — Jaap, Jul 19 '17 at 06:27
I think you need to add an extra line in `dflist`: `return(draft)` (or just `draft`) — Jaap, Jul 19 '17 at 06:32
I have added return(darft) as you suggested. Please see the revised code below but I'm still not getting the output. Please suggest. dflist <- lapply(site, function(i) { webpage <- read_html(i) draft_table <- html_nodes(webpage,'.excerpt') draft <- html_text(draft_table) return(draft) }) — Tanuvi, Jul 19 '17 at 06:44
Please read [ask] and include a description of what you expect and what is going wrong. That will make it a lot easier for others to help you. — Jaap, Jul 19 '17 at 06:50

score 1 · Answer 1 · answered Jul 19 '17 at 07:58

Some websites put a security to prevent bulk scraping. I guess SO is one of them. More on that : https://github.com/JonasCz/How-To-Prevent-Scraping/blob/master/README.md

In fact, if you delay a little your calls, this will work. I've tried w/ 5 seconds Sys.sleep. I guess you can reduce it, but this may not work (I've tried with a 1 second Sys.sleep, that didn't work).

Here is a working code :

library(rvest)
library(purrr)

dflist <- map(.x = 1:10, .f = function(x) {
  Sys.sleep(5)
  url <- paste0("https://stackoverflow.com/search?page=",x,"&q=%5bazure%5d%20free%20tier")
  read_html(url) %>%
    html_nodes('.excerpt') %>%
    html_text() %>%
    as.data.frame()
}) %>% do.call(rbind, .)

Best,

Colin

This is of great help. Thank you so much..This is exactly what I was looking for. — Tanuvi, Jul 19 '17 at 08:44

Web Scraping multiple Links using R

1 Answers1

Linked