I'm working on getting dates from tripadvisor's reviews.
I started with:
The dates have two formats: A normal DD/MM/YYYY format Opinión escrita el 21 mayo 2010
and a relative date format Opinión escrita hace 4 días
.
The 'normal format' has a class named ratingDate
<span class="ratingDate">Opinión escrita el 25 octubre 2006</span>
The 'relative format' has two classes ratingDate
and relativeDate
<span title="6 marzo 2016" class="ratingDate relativeDate">Opinión escrita hace 4 días</span>
I'm using R and rvest
package to scrap the dates.
url_hotel <- "https://www.tripadvisor.es/Hotel_Review-g562819-d237083-Reviews-or150-Hotel_Riu_Don_Miguel-Playa_del_Ingles_Maspalomas_Gran_Canaria_Canary_Islands.html#REVIEWS"
html_hotel <- url_hotel %>% read_html()
And here is my problem. When I try to scrap the dates with this code
dates <- html_hotel %>% html_nodes(".ratingDate")
I get the 'normal date' only but not the other ones.
Trying to to find a solution I reached here but
dates <- html_hotel %>% html_nodes(xpath="//*[contains(concat(' ', normalize-space(@class), ' '), ' ratingDate ')]")
din't work. I keep getting the same results.
Here someone was trying to get the same data from Tripadvisor but using Python. Neither his answer worked
dates <- html_hotel %>% html_nodes(xpath='//div[@class="col2of2"]//span[@class="ratingDate relativeDate"/@title or @class="ratingDate"]')
Is there any way, setting a good XPath or something, to get 'relative dates' ?
Thanks in advance.