0

I am working with the R programming language.

I trying to scrape the name and address of the pizza stores on this website https://www.yellowpages.ca/search/si/2/pizza/Canada (e.g. https://www.yellowpages.ca/search/si/2/pizza/Canada, https://www.yellowpages.ca/search/si/3/pizza/Canada, https://www.yellowpages.ca/search/si/4/pizza/Canada, etc.)

I am trying to follow the answer provided here: Scraping Yellowpages in R

library(rvest)
library(stringr)

url <- "https://www.yellowpages.com.au/search/listings?clue=plumbers&locationClue=Greater+Sydney%2C+NSW&lat=&lon=&selectedViewMode=list"


library(rvest)
library(stringr)

testscrape <- function(url){
  webpage <- read_html(url)
  
  docname <- webpage %>%
    html_nodes(".left .listing-name") %>%
    html_text()
  
  ph_no <- webpage %>%
    html_nodes(".contact-phone .contact-text") %>%
    html_text()
  
  email <- webpage %>%
    html_nodes(".contact-email") %>%
    html_attr("href") %>%
    as.character() %>%
    str_remove_all(".*:") %>%
    str_remove_all("\\?(.*)") %>%
    str_replace_all("%40","@")
    n <- seq_len(max(length(practice), length(ph_no), length(email)))
    tibble(docname = practice[n], ph_no = ph_no[n], email = email[n])
}
testscrape(url)

But this code is taking a very long time to run. I tried to investigate by running individual parts of the function, and I think I found the problem: The "read_html" statement itself is not working. I tried to replace this with another statement:

 library(httr)
 webpage <- GET(url)

This works, but now the format is not the same.

Can someone please show me how to do this?

In the end, I would like the output to look something like this:

  id                 name                                       address
1  1   OJ's Steak & Pizza 9906B Franklin Ave, Fort McMurray, AB T9H 2K5
2  2    MJs Pizza & Grill 10012 Franklin Ave, Fort McMurray, AB T9H 2K6
3  3 Hu's Pizza & Donairs 10020 Franklin Ave, Fort McMurray, AB T9H 2K6

# sample results

sample_results = structure(list(id = c(1, 2, 3), name = c("OJ's Steak & Pizza", 
"MJs Pizza & Grill", "Hu's Pizza & Donairs"), address = c("9906B Franklin Ave, Fort McMurray, AB T9H 2K5", 
"10012 Franklin Ave, Fort McMurray, AB T9H 2K6", "10020 Franklin Ave, Fort McMurray, AB T9H 2K6"
)), class = "data.frame", row.names = c(NA, -3L))

Thanks!

stats_noob
  • 5,401
  • 4
  • 27
  • 83

1 Answers1

1

Fast, but not robust. (If there are missing either name or address, the code will break, I think.)

library(tidyverse)
library(rvest)

scraper <- function(url) {
  page <- url %>% 
    read_html()
  
  tibble(
    name = page %>%  
      html_elements(".jsListingName") %>% 
      html_text2(),
    address = page %>% 
      html_elements(".listing__address--full") %>% 
      html_text2()
  )
}

scraper("https://www.yellowpages.ca/search/si/2/pizza/Canada")

# A tibble: 35 x 2
   name                                  address                                 
   <chr>                                 <chr>                                   
 1 OJ's Steak & Pizza                    9906B Franklin Ave, Fort McMurray, AB T~
 2 MJs Pizza & Grill                     10012 Franklin Ave, Fort McMurray, AB T~
 3 Hu's Pizza & Donairs                  10020 Franklin Ave, Fort McMurray, AB T~
 4 Eagle Ridge Convenience Store & Pizza 117-375 Loutit Rd, Fort McMurray, AB T9~
 5 Cosmos Pizza                          9713 Hardin St, Fort McMurray, AB T9H 1~
 6 Boston Pizza                          10202 MacDonald Ave, Fort McMurray, AB ~
 7 Jomaa's Pizza & Chicken               Beacon Hill Shpg Plaza, Fort McMurray, ~
 8 Abasand PK's Pizza                    101-307 Athabasca Ave, Fort McMurray, A~
 9 Pizza 73                              1-289 Powder Dr, Ft McMurray, AB T9K 0M5
10 Boston Pizza                          110 Millennium Dr, Fort McMurray, AB T9~
# ... with 25 more rows
# i Use `print(n = ...)` to see more rows
Chamkrai
  • 5,912
  • 1
  • 4
  • 14
  • @ Tom Hoel: Thank you so much for your answer! Everything works perfectly! Do you have any ideas to why my "webpage %>% html_text()" was taking so long to run? Thank you much! – stats_noob Sep 11 '22 at 16:40
  • For example, using the code you provided - if I only try : "page <- url %>% read_html()" ... this line of code seems to be taking a long time to run.... yet the whole code runs almost instantly. I wonder how is this possible? – stats_noob Sep 11 '22 at 16:42
  • @stats_noob I have no idea, sorry. If `read_html()` is slow, it is probably internet issues or slow internet. The code is quite fast. – Chamkrai Sep 11 '22 at 16:46
  • How many seconds are we talking? – Chamkrai Sep 11 '22 at 16:46
  • So far the individual statement won't run it all... yet for some reason, your entire code runs perfectly. – stats_noob Sep 11 '22 at 16:51
  • @ Tom Hoel: You mentioned that the code might break, I wonder if it might be possible to avoid this by incorporating the following statement : {tryCatch({ #insert code here# }, error = function(e){}) } – stats_noob Sep 11 '22 at 16:53
  • @stats_noob Are you trying to scrape multiple pages? – Chamkrai Sep 11 '22 at 16:54
  • https://stackoverflow.com/questions/66457617/trycatch-in-r-programming – stats_noob Sep 11 '22 at 16:54
  • @stats_noob I am familir with tryCatch, but I scraped 1:20 pages now and it threw no error. The code is fine! – Chamkrai Sep 11 '22 at 16:57
  • @ Tom Hoel: Hi Tom, I hope you are doing well! Recently, I have been trying to modify the answer your provided in this post and apply it to another question: https://stackoverflow.com/questions/74438814/r-webscraping-pizza-shops-adding-phone-numbers - if you have time can you please take a look at this question? Thank you so much for all your help! – stats_noob Nov 14 '22 at 22:59