1

I am working with the R programming language.

In a previous question (R: Webscraping Pizza Shops - "read_html" not working?), I learned how to scrape the names and address of Pizza Stores from YellowPages (e.g. https://www.yellowpages.ca/search/si/2/pizza/Canada). Here is the code for how to scrape a single page:

library(tidyverse)
library(rvest)

scraper <- function(url) {
  page <- url %>% 
    read_html()
  
  tibble(
    name = page %>%  
      html_elements(".jsListingName") %>% 
      html_text2(),
    address = page %>% 
      html_elements(".listing__address--full") %>% 
      html_text2()
  )
}

I then tried to make a LOOP that will repeat this for all 391 pages:

enter image description here

a = "https://www.yellowpages.ca/search/si/"

b = "/pizza/Canada"

list_results = list()

for (i in 1:391)

{

url_i = paste0(a,i,b)

s_i = data.frame(scraper(url_i))

ss_i = data.frame(i,s_i)

print(ss_i)
list_results[[i]] <- ss_i


}

final = do.call(rbind.data.frame, list_results)

My Problem: I noticed that after the 60th page, I get the following error:

Error in data.frame(i, s_i) : 
  arguments imply differing number of rows: 1, 0
In addition: Warning message:
In for (i in seq_along(specs)) { :
  closing unused connection 

To investigate, I went to the 60th page (https://www.yellowpages.ca/search/si/60/pizza/Canada) and noticed that you can not click beyond this page:

enter image description here

My Question: Is there something that I can do differently to try and move past the 60th page, or is there some internal limitation within YellowPages that is preventing from me scraping further?

Thanks!

stats_noob
  • 5,401
  • 4
  • 27
  • 83

1 Answers1

1

This is a limit in the Yellow Pages preventing to continue to the next page. A solution is to assign the return value of scraper and check the number of rows. If it is 0, break the for loop.

a = "https://www.yellowpages.ca/search/si/"
b = "/pizza/Canada"
list_results <- list()

for (i in 1:391) {
  url_i = paste0(a,i,b)
  
  s <- scraper(url_i, i)
  message(paste("page number:", i, "\trows:", nrow(s)))
  if(nrow(s) > 0L) {
    s_i <- as.data.frame(s)
    ss_i <- data.frame(i, s_i)
  } else {
    message("empty page, bailing out...")
    break
  }
  list_results[[i]] <- ss_i
}

final <- do.call(rbind.data.frame, list_results)
dim(final)
# [1] 2100    3
Rui Barradas
  • 70,273
  • 8
  • 34
  • 66
  • @ Rui Barradas: Thank you for your answer! Thank you for confirming this! – stats_noob Sep 13 '22 at 02:05
  • Can you please take a look at this question if you have time? https://stackoverflow.com/questions/73696551/r-webscraping-error-arguments-imply-differing-number-of-rows Thank you so much! – stats_noob Sep 13 '22 at 02:05