0

I am trying to scrape a list of plumbers from http://www.yellowpages.com.au to build a tibble.

The code works fine with each section (name, phone number, email) but when I put it together in a function to build the tibble it hits an error because some don't have phone numbers or emails.

url <- "https://www.yellowpages.com.au/search/listings?clue=plumbers&locationClue=Greater+Sydney%2C+NSW&lat=&lon=&selectedViewMode=list"

testscrape <- function(){
  webpage <- read_html(url)
  
  docname <- webpage %>%
    html_nodes(".left .listing-name") %>%
    html_text()
  
  ph_no <- webpage %>%
    html_nodes(".contact-phone .contact-text") %>%
    html_text()
  
  email <- webpage %>%
    html_nodes(".contact-email") %>%
    html_attr("href") %>%
    as.character() %>%
    str_remove_all(".*:") %>%
    str_remove_all("\\?(.*)") %>%
    str_replace_all("%40","@")
  
  return(tibble(docname = docname, ph_no = ph_no, email = email))
}

Then I run the function:

test_run <- testscrape
test_run()

And the following errors arrive:

Error: Tibble columns must have compatible sizes.
* Size 36: Existing data.
* Size 17: Column `ph_no`.
ℹ Only values of size one are recycled.
Run `rlang::last_error()` to see where the error occurred.
Called from: signal_abort(cnd)
Browse[1]> 

Which leaves it hanging.

I appreciate that there are fewer phone numbers than listed plumbers so how do I create a N/A return for that line for that plumber so that the numbers align with the relevant plumbers?

Thanks in advance.

1 Answers1

1

You can subset the extracted data to get 1st value which will give NA when the value is empty.

library(rvest)
library(stringr)

testscrape <- function(url){
  webpage <- read_html(url)
  
  docname <- webpage %>%
    html_nodes(".left .listing-name") %>%
    html_text()
  
  ph_no <- webpage %>%
    html_nodes(".contact-phone .contact-text") %>%
    html_text()
  
  email <- webpage %>%
    html_nodes(".contact-email") %>%
    html_attr("href") %>%
    as.character() %>%
    str_remove_all(".*:") %>%
    str_remove_all("\\?(.*)") %>%
    str_replace_all("%40","@")
    n <- seq_len(max(length(practice), length(ph_no), length(email)))
    tibble(docname = practice[n], ph_no = ph_no[n], email = email[n])
}
testscrape(url)

# docname ph_no email
#  <lgl>   <lgl> <lgl>
#1   NA      NA    NA   
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • Because I copy-pasted wrong thing. Sorry. Can you try the updated answer? – Ronak Shah Jan 13 '21 at 08:04
  • Can you provide couple of links using which I can test my answer? BTW the `url` which you have shared returns `docname`, `ph_no` and `email` as `NA`. Is that correct? Also I have updated answer with another modification maybe that would help. – Ronak Shah Jan 13 '21 at 08:24
  • Even for this link all `docname`, `ph_no` and `email` are empty. Are you sure your code works? In your question you have mentioned `The code works fine with each section` but I don't find it with any of the example that you have shared till now so I don't know what exactly needs to be debugged. – Ronak Shah Jan 13 '21 at 08:46
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/227267/discussion-between-foothill-trudger-and-ronak-shah). – Foothill_trudger Jan 13 '21 at 08:54