0

I'm trying to scrape data from TripAdvisor search results that span several pages using rvest.

Here's my code:

library(rvest)

starturl <- 'https://www.tripadvisor.co.uk/Search?q=swim+with&uiOrigin=trip_search_Attractions&searchSessionId=CA54193AF19658CB1D983934FB5C86F41511875967385ssid#&ssrc=A&o=0'

swimwith <- read_html(starturl)

swdf <- swimwith %>%
html_nodes('.title span') %>%
html_text() 

It works fine for the first page of results, but I can't figure out how to get results from the subsequent pages. I noticed that the end of the url denotes the start position of the results, so I changed it from '0' to '30' as follows:

url <- sub('A&o=0', paste0('A&o=', '30'), starturl)

webpage <- html_session(url)
swimwith <- read_html(webpage)

swdf2 <- swimwith %>%
html_nodes('.title span') %>%
html_text() 

However, the results for swdf2 are the same as swdf even though the url loads the second page of results in a web browser.

Any idea how I can get the results from these subsequent pages?

Adamaki
  • 47
  • 6
  • I think you won't get past using Selenium (and even then I am not sure it will work). – Martin Schmelzer Nov 28 '17 at 17:31
  • What happens if you start a fresh session of R, and try with the URL: `starturl <- 'https://www.tripadvisor.co.uk/Search?q=swim+with&uiOrigin=trip_search_Attractions&searchSessionId=CA54193AF19658CB1D983934FB5C86F41511875967385ssid#&ssrc=A&o=30'`, do you get the second page? – Mako212 Nov 28 '17 at 17:54
  • I should also add that rvest has functions to navigate a website follow_link() and jump_to() but they don’t work here because the links are JavaScript buttons. – Adamaki Nov 29 '17 at 08:14
  • @Mako212 I just tried starting a new session with the second page link but it still gets results from the first page. – Adamaki Dec 04 '17 at 10:57
  • @MartinSchmelzer are you saying I need to use Selenium? – Adamaki Dec 04 '17 at 10:58
  • Yes! The Javascript nature of the page's navigation is the problem. I recently coded some scrapers using Selenium as well. I had the feeling that it is still a little more stable to use Python + Selenium than RSelenium. – Martin Schmelzer Dec 04 '17 at 12:18
  • @MartinSchmelzer thanks. I'm currently looking into RSelenium but can't get the selenium server running yet... – Adamaki Dec 04 '17 at 13:58

2 Answers2

0

I think you want something like this.

jump <- seq(0, 300, by = 30)
site <- paste('https://www.tripadvisor.co.uk/Search?q=swim+with&uiOrigin=trip_search_Attractions&searchSessionId=CA54193AF19658CB1D983934FB5C86F41511875967385ssid#&ssrc=A&o=', jump, sep="")

dfList <- lapply
(site, function(i) 
{

  swimwith <- read_html(i)

  swdf <- swimwith %>%
  html_nodes('.title span') %>%
  html_text()


}
)

finaldf <- do.call(rbind, dfList) 

It doesn't work in my office because the firewall is blocking it, but I think that should work for you.

Also, take a look at the links below.

https://rpubs.com/ryanthomas/webscraping-with-rvest

loop across multiple urls in r with rvest

ASH
  • 20,759
  • 19
  • 87
  • 200
0

Approach 1) Here is an approach based on the R package RSelenium :

library(RSelenium)

# Note : You have to install chromedriver 
rd <- rsDriver(chromever = "96.0.4664.45", browser = "chrome", port = 4450L) 
remDr <- rd$client
remDr$open()
remDr$navigate("https://www.tripadvisor.co.uk/Search?q=swim+with&uiOrigin=trip_search_Attractions&searchSessionId=CA54193AF19658CB1D983934FB5C86F41511875967385ssid#&ssrc=A&o=0")

remDr$screenshot(display = TRUE, useViewer = TRUE) 

list_Text <- list()

for(i in 1 : 30)
{
  print(i)
  web_Obj <- remDr$findElement("xpath", paste0("//*[@id='BODY_BLOCK_JQUERY_REFLOW']/div[2]/div/div[2]/div/div/div/div/div[1]/div/div[1]/div/div[3]/div/div[1]/div/div[2]/div/div/div[", i, "]"))
  list_Text[[i]] <- web_Obj$getElementText()  
}

Note : You have to install chromedriver.

Approach 2) If you are looking to extract the titles only, you can print the webpage to PDF and extract the text from the PDF afterwards. Here is an example :

library(pagedown)
library(pdftools)
chrome_print("https://www.tripadvisor.co.uk/Search?q=swim+with&uiOrigin=trip_search_Attractions&searchSessionId=CA54193AF19658CB1D983934FB5C86F41511875967385ssid#&ssrc=A&o=0",
             "C:\\...\\trip_advisor.pdf")

text <- pdf_text("C:\\...\\trip_advisor.pdf")
text  <- strsplit(text, split = "\r\n")

# The titles are in the variable text ...
Emmanuel Hamel
  • 1,769
  • 7
  • 19