How to report NA when scraping a web with R and it does not have value?

Question

I am scraping from a page in booking.com and creating the dataframe I have noticed that not all hotels have ratings.

I tried this for example:

# Got the elements from Inspect code of the page
titles_page <- page %>% html_elements("div[data-testid='title'][class='fcab3ed991 a23c043802']") %>% html_text()
prices_page <- page %>% html_elements("span[data-testid='price-and-discounted-price']") %>% html_text()
ratings_page <- page %>% html_elements("div[aria-label^='Punteggio di']") %>% html_text()

# The variable ratings
tryCatch(expr ={
      ratings_page <- remDr$findElements(using = "xpath", value = "div[aria-label^='Punteggio di']")$getElementAttribute('value')
    },   
    #If the information does not exist in this way it writes NA to the ratings element
    error = function(e){          
      ratings_page <-NA
    })

And it does not change anything.

How to report NA where the object does not have value?

The link

Tried the solution but it does not happen anything. take a look: ```#parse out the parent node for each parent titles_page <- page %>% html_elements("div[data-testid='title']") %>% html_children() #parse out the requested information from each child. prices_page <- titles_page %>% html_elements("span[data-testid='price-and-discounted-price']") %>% html_text() ratings_page <- titles_page %>% html_elements("div[aria-label^='Punteggio di']") %>% html_text()``` — Anisa, Apr 18 '23 at 18:35

score 0 · Answer 1 · answered Apr 16 '23 at 19:05

0

Maybe something like the following. Untested.

# The variable ratings
ratings_page <- tryCatch(
  expr = {
    elem <- remDr$findElements(using = "xpath", value = "div[aria-label^='Punteggio di']")
    elem$getElementAttribute('value')
  },   
  # If the information does not exist in this way it writes NA to the ratings element
  error = function(e) NA
)

answered Apr 16 '23 at 19:05

Rui Barradas

70,273
8
34
66

This was a superb anwer, but when I ran the code, I got a dataframe with all NA values. I guess it misses only sth small – Anisa Apr 18 '23 at 18:18
1

@Anisa We don't have an url to test the code. Furthermore, you are using RSelenium, not rvest. – Rui Barradas Apr 18 '23 at 19:09
No I’m using Rvest, I’ve not loaded Rselenium – Anisa Apr 19 '23 at 09:06
@Anisa But `remDr$findElements` and `elem$getElementAttribute` are both RSelenium code. Maybe the error comes from there, try to run the first of these instructions to see what you get. Probably a *"could not find function"* error. If I'm right, then the `expr` part of `tryCatch` will always give an error and the `error` part is always executed, always returning `NA`. – Rui Barradas Apr 19 '23 at 09:09
@Anisa Also, can you post the exact url in the question, please, so that we can test the code? – Rui Barradas Apr 19 '23 at 09:12
Ofc but it’s too long for comment. How can I send it otherwise? – Anisa Apr 19 '23 at 10:08
1

@Anisa You can [edit the question](https://stackoverflow.com/posts/76029792/edit) and post the long url there. You must include a reason why the edit. Then save the edit. – Rui Barradas Apr 19 '23 at 10:56
I did it, check it – Anisa Apr 19 '23 at 17:05

Dave2e · Accepted Answer · 2023-04-19T22:44:51.437

Here is a solution based on the strategy from this link: How do you scrape items together so you don't lose the index?.

The key here is using html_element() (without the s). html_element() will always return an answer even if it is NA. This way if the element is missing in the parent node, NA will fill the gaps.

library(rvest)
library(dplyr)

#read the page
url <-"https://www.booking.com/searchresults.it.html?ss=Firenze%2C+Toscana%2C+Italia&efdco=1&label=booking-name-L*Xf2U1sq4*GEkIwcLOALQS267777916051%3Apl%3Ata%3Ap1%3Ap22%2C563%2C000%3Aac%3Aap%3Aneg%3Afi%3Atikwd-65526620%3Alp9069992%3Ali%3Adec%3Adm%3Appccp&aid=376363&lang=it&sb=1&src_elem=sb&src=index&dest_id=-117543&dest_type=city&ac_position=0&ac_click_type=b&ac_langcode=it&ac_suggestion_list_length=5&search_selected=true&search_pageview_id=2e375b14ad810329&ac_meta=GhAyZTM3NWIxNGFkODEwMzI5IAAoATICaXQ6BGZpcmVAAEoAUAA%3D&checkin=2023-06-11&checkout=2023-06-18&group_adults=2&no_rooms=1&group_children=0&sb_travel_purpose=leisure&fbclid=IwAR1BGskP8uicO9nlm5aW7U1A9eABbSwhMNNeQ0gQ-PNoRkHP859L7u0fIsE"
page <- read_html(url)

#parse out the parent node for each parent 
properties <- html_elements(page, xpath=".//div[@data-testid='property-card']")

#now find the information from each parent
#notice html_element - no "s"
title <- properties %>% html_element("div[data-testid='title']") %>% html_text()
prices <- properties %>% html_element("span[data-testid='price-and-discounted-price']") %>% html_text()    
ratings <- properties %>% html_element(xpath=".//div[@aria-label]") %>% html_text()

data.frame(title, prices, ratings)

                                       title   prices ratings
1                   Sweetly home in Florence US$1.918    <NA>
2                                   Pepi Red US$3.062        
3                 hu Firenze Camping in Town   US$902     8,4
4                              Plus Florence US$1.754     7,9
5                     Artemente Florence B&B US$4.276        
6                                Villa Aruch US$1.658        
7                                Hotel Berna US$2.184        
8                                Hotel Gioia US$2.437        
9                              Hotel Magenta US$3.250        
10                              Villa Neroli US$3.242        
11                       Residenza Florentia US$2.792     8,0
12                Ridolfi Sei Suite Florence US$1.243    <NA>
...

I get this Error in as.data.frame.default(x[[i]], optional = TRUE, stringsAsFactors = stringsAsFactors) : cannot coerce class ‘"xml_nodeset"’ to a data.frame — Anisa, Apr 19 '23 at 20:52
The above code works on my system with the sample URL. Double check (with `str()` function) to make sure title, prices, ratings, etc. have all been converted from a node into a character string using the `html_text()` function. — Dave2e, Apr 19 '23 at 22:35

How to report NA when scraping a web with R and it does not have value?

2 Answers2

Linked