0

I am working with the R programming language.

I trying to scrape the name, address and phone numbers of the pizza stores on this website :

Using the answer provided here (R: Webscraping Pizza Shops - "read_html" not working?), I learned how to write the following function to perform this task:

library(tidyverse)
library(rvest)

scraper <- function(url) {
  page <- url %>% 
    read_html()
  
  tibble(
    name = page %>%  
      html_elements(".jsListingName") %>% 
      html_text2(),
    address = page %>% 
      html_elements(".listing__address--full") %>% 
      html_text2()
  )
}

scraper("https://www.yellowpages.ca/search/si/2/pizza/Canada")

Now, I would like to include the phone number for each of these pizza shops.

Looking at the source code of this website, I see that this information is included within a <h4> tag:

enter image description here

But I am not sure how I can include this specification in the existing webscraping code.

Can someone please show me how to do this?

Thanks!

stats_noob
  • 5,401
  • 4
  • 27
  • 83

1 Answers1

2

This is pretty straight forward. The key is to perform the parsing in two steps. First find the parent node for each business then extract out the phone number.

library(rvest)

page<-read_html("https://www.yellowpages.ca/search/si/2/pizza/Canada")
#get the individual business card
nodes <- page %>% html_elements("div.listing_right_section")
#find the phone number node
phone <-  nodes %>% html_element("ul h4") %>% html_text()

In this case the phone numbers are within a "h4" tag underneath an "ul" tag.

Update
Incorporating the above code into your code:

scraper <- function(url) {
   page <- url %>% read_html(url)
   
   #find the business records nodes
   businesses <- page %>% html_elements("div.listing_right_section")
   
   #now extract the request information from each node
   tibble(
      name = businesses %>% html_element(".jsListingName") %>% html_text2(),
      address = businesses %>% html_element(".listing__address--full") %>% html_text2(),
      phone =  businesses %>% html_element("ul h4") %>% html_text()
   )
}

name_address = scraper("https://www.yellowpages.ca/search/si/2/pizza/Canada")

The key for successful scraping is to use html_elements() to grab all of the parent nodes, each of which contain a full record. Then use html_element() (without the s) to grab the wanted field (child node) out of each parent node.
This will always work. Using html_elements will provide vectors with a variable number of elements if there is some missing information.

Dave2e
  • 22,192
  • 18
  • 42
  • 50
  • thank you so much for your answer! I tried to incorporate your code into the existing code (see answer below). Have I done this correctly? Thank you so much! – stats_noob Nov 15 '22 at 01:11