I had built a web scraper for Amazon reviews a while ago that collected reviews and then did text mining and created phrase clouds. However, at some point, Amazon changed the whole format and now nothing works. I am using a URL that I had scraped in the past. In the example, I am looking to get the character 1,284.
The item that I am looking to get is in the page source code below:
<div class="a-row">
<span data-hook="total-review-count"
class="a-size-medium totalReviewCount">1,284</span>
</div>
Aside from changing the format, they also changed from http to https which has also caused some problems, but I think I overcame them using code from this link:
https://quantmacro.wordpress.com/2016/04/30/web-scraping-for-text-mining-in-r/
Here is all of the code
library(XML)
library(RCurl)
fileURL1 <- 'https://www.amazon.com/Pet-Products-Self-Warming-Lounge-Sleeper/product-reviews/B00JHK370E/ref=cm_cr_getr_d_paging_btm_1?ie=UTF8&reviewerType=all_reviews&pageNumber=2'
# this is from the reference link above
pageData <- getURL(fileURL1, ssl.verifypeer = FALSE)
class(pageData)
[1] "character"
is.vector(pageData)
[1] TRUE
print(pageData)
doc <- htmlTreeParse(pageData, useInternal=TRUE)
print(doc)
Now for the XPAth, given the HTML snippet above, I tried the following options:
reviewCounts <- xpathSApply(doc, "//div[@class='a-row']/span[contains(data-hook='total-review-count']", xmlValue)
Returns error:
Error in xpathApply.XMLInternalDocument(doc, path, fun, ..., namespaces = namespaces, :
error evaluating xpath expression //div[@class='a-row']/span[contains(data-hook='total-review-count']
The next options all return List of 0
:
reviewCounts <- xpathSApply(doc, "//span[@class='a-size-medium totalReviewCount']", xmlValue)
reviewCounts <- xpathSApply(doc, "//div[@class='a-row']/span[@class='a-size-medium totalReviewCount']", xmlValue)
reviewCounts <- xpathSApply(doc, "//div[@class='a-size-medium totalReviewCount']", xmlValue)
I did get it to go once, but when I re-ran the code, it did not work. Is the https causing the doc to not be true or is my XPATH wrong?