I am working with the R programming language.
I trying to scrape the name and address of the pizza stores on this website https://www.yellowpages.ca/search/si/2/pizza/Canada (e.g. https://www.yellowpages.ca/search/si/2/pizza/Canada, https://www.yellowpages.ca/search/si/3/pizza/Canada, https://www.yellowpages.ca/search/si/4/pizza/Canada, etc.)
I am trying to follow the answer provided here: Scraping Yellowpages in R
library(rvest)
library(stringr)
url <- "https://www.yellowpages.com.au/search/listings?clue=plumbers&locationClue=Greater+Sydney%2C+NSW&lat=&lon=&selectedViewMode=list"
library(rvest)
library(stringr)
testscrape <- function(url){
webpage <- read_html(url)
docname <- webpage %>%
html_nodes(".left .listing-name") %>%
html_text()
ph_no <- webpage %>%
html_nodes(".contact-phone .contact-text") %>%
html_text()
email <- webpage %>%
html_nodes(".contact-email") %>%
html_attr("href") %>%
as.character() %>%
str_remove_all(".*:") %>%
str_remove_all("\\?(.*)") %>%
str_replace_all("%40","@")
n <- seq_len(max(length(practice), length(ph_no), length(email)))
tibble(docname = practice[n], ph_no = ph_no[n], email = email[n])
}
testscrape(url)
But this code is taking a very long time to run. I tried to investigate by running individual parts of the function, and I think I found the problem: The "read_html" statement itself is not working. I tried to replace this with another statement:
library(httr)
webpage <- GET(url)
This works, but now the format is not the same.
Can someone please show me how to do this?
In the end, I would like the output to look something like this:
id name address
1 1 OJ's Steak & Pizza 9906B Franklin Ave, Fort McMurray, AB T9H 2K5
2 2 MJs Pizza & Grill 10012 Franklin Ave, Fort McMurray, AB T9H 2K6
3 3 Hu's Pizza & Donairs 10020 Franklin Ave, Fort McMurray, AB T9H 2K6
# sample results
sample_results = structure(list(id = c(1, 2, 3), name = c("OJ's Steak & Pizza",
"MJs Pizza & Grill", "Hu's Pizza & Donairs"), address = c("9906B Franklin Ave, Fort McMurray, AB T9H 2K5",
"10012 Franklin Ave, Fort McMurray, AB T9H 2K6", "10020 Franklin Ave, Fort McMurray, AB T9H 2K6"
)), class = "data.frame", row.names = c(NA, -3L))
Thanks!