-1

I'm working on getting dates from tripadvisor's reviews.

I started with:

https://www.tripadvisor.es/Hotel_Review-g562819-d237083-Reviews-or150-Hotel_Riu_Don_Miguel-Playa_del_Ingles_Maspalomas_Gran_Canaria_Canary_Islands.html#REVIEWS

The dates have two formats: A normal DD/MM/YYYY format Opinión escrita el 21 mayo 2010 and a relative date format Opinión escrita hace 4 días.

The 'normal format' has a class named ratingDate

<span class="ratingDate">Opinión escrita el 25 octubre 2006</span>

The 'relative format' has two classes ratingDate and relativeDate

<span title="6 marzo 2016" class="ratingDate relativeDate">Opinión escrita hace 4 días</span>

I'm using R and rvest package to scrap the dates.

url_hotel <- "https://www.tripadvisor.es/Hotel_Review-g562819-d237083-Reviews-or150-Hotel_Riu_Don_Miguel-Playa_del_Ingles_Maspalomas_Gran_Canaria_Canary_Islands.html#REVIEWS"
html_hotel <- url_hotel %>% read_html()

And here is my problem. When I try to scrap the dates with this code

dates <- html_hotel %>% html_nodes(".ratingDate")

I get the 'normal date' only but not the other ones.

Trying to to find a solution I reached here but

dates <- html_hotel %>% html_nodes(xpath="//*[contains(concat(' ', normalize-space(@class), ' '), ' ratingDate ')]")

din't work. I keep getting the same results.

Here someone was trying to get the same data from Tripadvisor but using Python. Neither his answer worked

dates <- html_hotel %>% html_nodes(xpath='//div[@class="col2of2"]//span[@class="ratingDate relativeDate"/@title or @class="ratingDate"]')   

Is there any way, setting a good XPath or something, to get 'relative dates' ?

Thanks in advance.

Community
  • 1
  • 1
  • I'm guessing you need other packages: `dates <- html_nodes(".ratingDate") Error in UseMethod("xml_find_all") : no applicable method for 'xml_find_all' applied to an object of class "character"`. And ... for R DD/MM/YYYY is NOT a "normal format" under the assumption that you expect "normal" to be default. – IRTFM Mar 11 '16 at 02:15
  • @42 Thanks for your comment. You got an error because the code was wrong. Sorry. It was my fault. It is already corrected. "normal format" is only a name to refer to a date that is not a "relative date". I can get this "normal date" and work with it in R. The problem is that I can't get the "relative date". A bad Xpath I suppose. – Christian Gonzalez-Martel Mar 11 '16 at 04:23

1 Answers1

0

This is my guess, but since we do not share locales, your dates are not spelled the same as my dates and the code delivers NA's in my locale, but try this (based on what I suspect is the correct Date format for your locale:

 dates %>%
   html_attr("title") %>%
   strptime("%d %B %Y") %>%
   as.POSIXct()

Taken from https://github.com/hadley/rvest/blob/master/demo/tripadvisor.R

IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • Thanks @42. Pehaps, I have not been clear in my explanation. The website has 10 reviews. With `dates <- html_hotel %>% html_nodes(".ratingDate")` i get a list of 8 elements with the eight first dates and I can convert them to a POSIXct object with `date %>% gsub("Opinión escrita el |\n","",.) %>% dmy()`. But, and here is the problem, I can't get the last two dates. They have `class="ratingDate relativeDate"` and I can't scrap them with `dates <- html_hotel %>% html_nodes(".ratingDate")`. So, I would like to know how the argument of the html_node function would be to scrap these dates. Thanks – Christian Gonzalez-Martel Mar 11 '16 at 14:01
  • When I look at the page delivered to Chrome with the "View Source" facility,there are only 8 items with the class "ratingDate" and only 8 opinion sections on the visible page have dates). Two opinions instead say "Opinión escrita hace 5 días". There are other dates but not with that class. I get an additional 9 nodes using `dates2 <- html_hotel %>% html_nodes(".date")` and three of them appear to be "opinion dates". – IRTFM Mar 11 '16 at 15:33
  • Furthermore, if I ask the website for an American English translation I now see 10 ratingDate values both on the source and when I use your coding to search for ".ratingDate". – IRTFM Mar 11 '16 at 15:46
  • Thanks again for your help. But if I go to the page [#60](http://www.tripadvisor.com/Hotel_Review-g562819-d237083-Reviews-or590-Hotel_Riu_Don_Miguel-Playa_del_Ingles_Maspalomas_Gran_Canaria_Canary_Islands.html#REVIEWS) of this hotel for the American English tripadvisor.com, I find the same problem. `url %>% html_read() %>% html_nodes(".ratingDate")` gives only eight nodes from the 10 reviews. I guess it has to be a problem with css selector but I can't find the solution and i'm going crazy finding it. – Christian Gonzalez-Martel Mar 12 '16 at 11:24
  • Compare to the source rather than to the rendered page. – IRTFM Mar 12 '16 at 16:32