I am trying to scrape a list of plumbers from http://www.yellowpages.com.au to build a tibble.
The code works fine with each section (name, phone number, email) but when I put it together in a function to build the tibble it hits an error because some don't have phone numbers or emails.
url <- "https://www.yellowpages.com.au/search/listings?clue=plumbers&locationClue=Greater+Sydney%2C+NSW&lat=&lon=&selectedViewMode=list"
testscrape <- function(){
webpage <- read_html(url)
docname <- webpage %>%
html_nodes(".left .listing-name") %>%
html_text()
ph_no <- webpage %>%
html_nodes(".contact-phone .contact-text") %>%
html_text()
email <- webpage %>%
html_nodes(".contact-email") %>%
html_attr("href") %>%
as.character() %>%
str_remove_all(".*:") %>%
str_remove_all("\\?(.*)") %>%
str_replace_all("%40","@")
return(tibble(docname = docname, ph_no = ph_no, email = email))
}
Then I run the function:
test_run <- testscrape
test_run()
And the following errors arrive:
Error: Tibble columns must have compatible sizes.
* Size 36: Existing data.
* Size 17: Column `ph_no`.
ℹ Only values of size one are recycled.
Run `rlang::last_error()` to see where the error occurred.
Called from: signal_abort(cnd)
Browse[1]>
Which leaves it hanging.
I appreciate that there are fewer phone numbers than listed plumbers so how do I create a N/A return for that line for that plumber so that the numbers align with the relevant plumbers?
Thanks in advance.