How can I specify in XPATH the following on a web page in R

Question

I had built a web scraper for Amazon reviews a while ago that collected reviews and then did text mining and created phrase clouds. However, at some point, Amazon changed the whole format and now nothing works. I am using a URL that I had scraped in the past. In the example, I am looking to get the character 1,284.

The item that I am looking to get is in the page source code below:

    <div class="a-row">
    <span data-hook="total-review-count" 
     class="a-size-medium totalReviewCount">1,284</span>
    </div>

Aside from changing the format, they also changed from http to https which has also caused some problems, but I think I overcame them using code from this link:

https://quantmacro.wordpress.com/2016/04/30/web-scraping-for-text-mining-in-r/

Here is all of the code

library(XML)
library(RCurl)

fileURL1 <- 'https://www.amazon.com/Pet-Products-Self-Warming-Lounge-Sleeper/product-reviews/B00JHK370E/ref=cm_cr_getr_d_paging_btm_1?ie=UTF8&reviewerType=all_reviews&pageNumber=2'

# this is from the reference link above
pageData <- getURL(fileURL1, ssl.verifypeer = FALSE)

class(pageData)
[1] "character"

is.vector(pageData)
[1] TRUE

print(pageData)

doc <- htmlTreeParse(pageData, useInternal=TRUE)
print(doc)

Now for the XPAth, given the HTML snippet above, I tried the following options:

reviewCounts <- xpathSApply(doc, "//div[@class='a-row']/span[contains(data-hook='total-review-count']", xmlValue)

Returns error:

Error in xpathApply.XMLInternalDocument(doc, path, fun, ..., namespaces = namespaces,  : 
  error evaluating xpath expression //div[@class='a-row']/span[contains(data-hook='total-review-count']

The next options all return List of 0:

reviewCounts <- xpathSApply(doc, "//span[@class='a-size-medium totalReviewCount']", xmlValue)

reviewCounts <- xpathSApply(doc, "//div[@class='a-row']/span[@class='a-size-medium totalReviewCount']", xmlValue)

reviewCounts <- xpathSApply(doc, "//div[@class='a-size-medium totalReviewCount']", xmlValue)

I did get it to go once, but when I re-ran the code, it did not work. Is the https causing the doc to not be true or is my XPATH wrong?

Here is quick thought to help you troubleshoot. Try `grepl("1,284", pageData, fixed=TRUE)`. Did you find anything? If not, take a closer look at what you got from `getURL`. Maybe it's not what you expect. You can also see how the output renders in a browser. — Jota, Jan 11 '18 at 04:30
Thanks, came up as FALSE, so it looks like I am not getting my page data as expected. Looks like the HTTPS may still be an issue. — Bryan Butler, Jan 11 '18 at 14:23
Take a closer look at `doc`. Literally look at everything that got returned. That should help you see what's going on. Some strong hints: https://stackoverflow.com/a/29647127 & https://stackoverflow.com/a/25937108. — Jota, Jan 12 '18 at 03:48
I took at look, it's that Amazon does not want people to scrape views anymore. I used to do it all the time. They return you a page of junk if they think you are scraping. — Bryan Butler, Jan 27 '21 at 22:38

How can I specify in XPATH the following on a web page in R

0 Answers0