-3

I am uisng R in RStudio and I have an R script which performs web scraping. I am stuck with an error message when running these specific lines:

      review<-ta1 %>%
              html_node("body") %>%
              xml_find_all("//div[contains@class,'location-review-review']")

The error message is as follows:

xmlXPathEval: evaluation failed
Error in `*tmp*` - review : non-numeric argument to binary operator
In addition: Warning message:
In xpath_search(x$node, x$doc, xpath = xpath, nsMap = ns, num_results = Inf) :
  Invalid predicate [1206]

Note: I have dplyr and rvest libraries loaded in my R script.

I had a look at the answers to the following question on StackOverflow: Non-numeric Argument to Binary Operator Error

I have a feeling my solution relates to the answer provided by Richard Border to the question linked above. However, I am having a hard time trying to figure out how to correct my R syntax based on that answer.

Thank you for looking into my question.

Sample of ta1 added:

{xml_document}
<html lang="en" xmlns:og="http://opengraphprotocol.org/schema/">
[1] <head>\n<meta http-equiv="content-type" content="text/html; charset=utf-8">\n<link rel="icon" id="favicon"  ...
[2] <body class="rebrand_2017 desktop_web Hotel_Review  js_logging" id="BODY_BLOCK_JQUERY_REFLOW" data-tab="TAB ...
user3115933
  • 4,303
  • 15
  • 54
  • 94

1 Answers1

1

I'm going to make a few assumptions here, since your post doesn't contain enough information to generate a reproducible example.

First, I'm guessing that you are trying to scrape TripAdvisor, since the id and class fields match for that site and since your variable is called ta1.

Secondly, I'm assuming that you are trying to get the text of each review and the number of stars for each, since that is the relevant scrape-able information in each of the classes you appear to be looking for.

I'll need to start by getting my own version of the ta1 variable, since that wasn't reproducible from your edited version.

library(httr)
library(rvest)
library(xml2)
library(magrittr)
library(tibble)

"https://www.tripadvisor.co.uk/"                          %>% 
paste0("Hotel_Review-g186534-d192422-Reviews-")           %>%
paste0("Glasgow_Marriott_Hotel-Glasgow_Scotland.html") -> url

ta1 <- url %>% GET %>% read_html

Now write the correct xpaths for the data of interest

# xpath for elements whose text contains reviews
xpath1 <- "//div[contains(@class, 'location-review-review-list-parts-Expand')]"

# xpath for the elements whose class indicate the ratings
xpath2 <- "//div[contains(@class, 'location-review-review-')]"
xpath3 <- "/span[contains(@class, 'ui_bubble_rating bubble_')]"

We can get the text of the reviews like this:

ta1                                             %>% 
xml_find_all(xpath1)                            %>% # run first query
html_text()                                     %>% # extract text
extract(!equals(., "Read more")) -> reviews         # remove "blank" reviews

And the associated star ratings like this:

ta1 %>% 
xml_find_all(paste0(xpath2, xpath3)) %>% 
xml_attr("class")                    %>% 
strsplit("_")                        %>%
lapply(function(x) x[length(x)])     %>% 
as.numeric                           %>% 
divide_by(10)                         -> stars

And our result looks like this:

tibble(rating = stars, review = reviews)
## A tibble: 5 x 2
#  rating review                                                                                             
#   <dbl> <chr>                                                                                              
#1      1 7 of us attended the Christmas Party on Satu~
#2      4 "We stayed 2 nights over last weekend to att~
#3      3 Had a good stay, but had no provision to kee~
#4      3 Booked an overnight for a Christmas shopping~
#5      4 Attended a charity lunch here on Friday and ~
Allan Cameron
  • 147,086
  • 7
  • 49
  • 87
  • You are a genius! Exactly what I am doing. You made my day! Thanks a lot! – user3115933 Dec 20 '19 at 15:39
  • I tried to convert that tibble into a data frame as follows: tibble1<-tibble(rating = stars, review = reviews) df1<-as.data.frame(tibble1) However, df1 is just showing a table with numbers. Any idea on how to solve this? – user3115933 Dec 21 '19 at 04:41
  • @user3115933 Are you sure this isn't just because of how dataframes are displayed in the console? What happens if you try `df1$review` or `df$review[1]`? – Allan Cameron Dec 23 '19 at 15:27