I am working with the R programming language.
In a previous question (R: Webscraping Pizza Shops - "read_html" not working?), I learned how to scrape the names and address of Pizza Stores from YellowPages (e.g. https://www.yellowpages.ca/search/si/2/pizza/Canada). Here is the code for how to scrape a single page:
library(tidyverse)
library(rvest)
scraper <- function(url) {
page <- url %>%
read_html()
tibble(
name = page %>%
html_elements(".jsListingName") %>%
html_text2(),
address = page %>%
html_elements(".listing__address--full") %>%
html_text2()
)
}
I then tried to make a LOOP that will repeat this for all 391 pages:
a = "https://www.yellowpages.ca/search/si/"
b = "/pizza/Canada"
list_results = list()
for (i in 1:391)
{
url_i = paste0(a,i,b)
s_i = data.frame(scraper(url_i))
ss_i = data.frame(i,s_i)
print(ss_i)
list_results[[i]] <- ss_i
}
final = do.call(rbind.data.frame, list_results)
My Problem: I noticed that after the 60th page, I get the following error:
Error in data.frame(i, s_i) :
arguments imply differing number of rows: 1, 0
In addition: Warning message:
In for (i in seq_along(specs)) { :
closing unused connection
To investigate, I went to the 60th page (https://www.yellowpages.ca/search/si/60/pizza/Canada) and noticed that you can not click beyond this page:
My Question: Is there something that I can do differently to try and move past the 60th page, or is there some internal limitation within YellowPages that is preventing from me scraping further?
Thanks!