Scraping TripAdvisor text using R and rvest

Question

Here is the link https://www.tripadvisor.com/AttractionProductReview-g60750-d12086300-San_Diego_Whale_Watching_Cruise-San_Diego_California.html

I want to get the text from "What to expect", I tried many ways but I couldn't get it.

link <- "https://www.tripadvisor.com/AttractionProductReview-g60750-d12086300-San_Diego_Whale_Watching_Cruise-San_Diego_California.html"
webpage <- read_html(link)

webpage %>% html_node( '#\\:lithium-RmpiitkqlsnklaH1\\: .KxBGd' ) %>% html_text(trim = T)
webpage %>% html_nodes('[data-has-vuc|="true"]') %>% html_text(trim = T)
webpage  %>% html_nodes("span.biGQs._P.pZUbB.KxBGd") %>% html_text(trim = T)

Any suggestion?

It may be the case that the data you're after is generated from JavaScript and so is not available to rvest. [RSelenium may be the way to go](https://stackoverflow.com/questions/63568086/rvest-and-sites-with-javascript). — Stuart Allen, Jun 15 '23 at 04:38

margusl · Accepted Answer · 2023-06-19T11:10:31.797

Using Chromote to render the page and to evaluate js to extract certain element(s). Probably not the most robust solution and probably requires some tweaking, but it should illustrate how one could approach such problems. Same javascript-driven strategy should work with (R)Selenium too.

library(chromote)
library(rvest)
b <- ChromoteSession$new()
{
  b$Page$navigate("https://www.tripadvisor.com/AttractionProductReview-g60750-d12086300-San_Diego_Whale_Watching_Cruise-San_Diego_California.html")
  b$Page$loadEventFired()
  Sys.sleep(2)
}

get_section <- function(c_session, section_text){
  # find element with javascript:
  # use XPath to find (the first) <span></span> element that includes section_text parameter, 
  # find closest <dt> in it's parents, get next <dt> sibling element, the <dd> with text content
  js_str <- paste0(
  'var xpath = \'//span[contains(., "',section_text,'")]\';
   document.evaluate(
     xpath, 
     document, null, XPathResult.UNORDERED_NODE_ITERATOR_TYPE, null )
   .iterateNext()
   .closest("dt")
   .nextElementSibling
   .innerHTML')
  c_session$Runtime$evaluate(js_str)$result$value %>% 
    read_html()
}
# simple section:
get_section(b, "What to expect") %>% html_text()
#> [1] "Travel back in time when sailing on the America, a replica of the sailing ship that won the first America's Cup sailing competition in 1851. Your classic sailing vessel provides a smooth ride and spacious decks, perfect for sailing on the Pacific Ocean in search of gray whales and other marine life. Since your boat principally moves under wind power without using the engine, your captain can get closer to the marine animals without scaring them. The boat's deep keel provides excellent stability and large decks offer unobstructed views, making the America a prime vessel for whale-watching.\n\nSnacks and drinks (non-alcoholic) are offered during the cruise. You are welcome to bring along a picnic lunch or your favorite bottle of wine to enjoy onboard.\n\nWhale sightings are guaranteed on your cruise. If no whales are sighted, you can return for a complimentary whale watching cruise on another day in the same season. The America also provides a 'No Seasickness' guarantee."

# structured section
# escape single quotes with 3 slashes
incl_section <- get_section(b, "What\\\'s included")
# list items
incl_section %>% html_elements("div:nth-child(1) li") %>% html_text()
#>  [1] "4.5-hr sailing cruise"                                                            
#>  [2] "Sodas and snacks"                                                                 
#>  [3] "Whale sighting guarantee: If you don't see a whale, you get to come back for free"
#>  [4] "No seasickness' guarantee: Lose your lunch, we get you a new one"                 
#>  [5] "9:00 AM trip: Check in between 8:00 AM - 8:30 AM "                                
#>  [6] "11:00 AM trip: Check-in starts at 10:00 AM and ends at 10:30 AM"                  
#>  [7] "1:15 PM trip: Check in between 12:30 - 12:45"                                     
#>  [8] "2:00 PM Trip: Check in between 1:15 PM - 1:30 PM"                                 
#>  [9] "Free Parking. Check-in located next to parking lot."                              
#> [10] "Bring layers, blankets, and jackets. It can be cold on the water!"                
#> [11] "Gratuities"                                                                       
#> [12] "Hotel pickup and drop off"

# what's not included
not_inluded <- incl_section %>% html_elements("div:nth-child(2)")
# header
not_inluded %>% html_element("div") %>% html_text()
#> [1] "What's not included"
# list items
not_inluded %>% html_elements("li") %>% html_text()
#> [1] "Gratuities"                "Hotel pickup and drop off"

^{Created on 2023-06-19 with reprex v2.0.2}

When I tried 'What's included" is not working, I added "//span[contains(., \'What\'s included\')]" — Ben, Jun 18 '23 at 17:48
@Ben, getting quotes and escape sequences right is bit tricky, refactored the answer a bit and added an example for `"What's included"` section. — margusl, Jun 19 '23 at 11:13

Scraping TripAdvisor text using R and rvest

1 Answers1