0

I had built a web scraper for Amazon reviews a while ago that collected reviews and then did text mining and created phrase clouds. However, at some point, Amazon changed the whole format and now nothing works. I am using a URL that I had scraped in the past. In the example, I am looking to get the character 1,284.

The item that I am looking to get is in the page source code below:

    <div class="a-row">
    <span data-hook="total-review-count" 
     class="a-size-medium totalReviewCount">1,284</span>
    </div>

Aside from changing the format, they also changed from http to https which has also caused some problems, but I think I overcame them using code from this link:

https://quantmacro.wordpress.com/2016/04/30/web-scraping-for-text-mining-in-r/

Here is all of the code

library(XML)
library(RCurl)

fileURL1 <- 'https://www.amazon.com/Pet-Products-Self-Warming-Lounge-Sleeper/product-reviews/B00JHK370E/ref=cm_cr_getr_d_paging_btm_1?ie=UTF8&reviewerType=all_reviews&pageNumber=2'

# this is from the reference link above
pageData <- getURL(fileURL1, ssl.verifypeer = FALSE)

class(pageData)
[1] "character"

is.vector(pageData)
[1] TRUE

print(pageData)

doc <- htmlTreeParse(pageData, useInternal=TRUE)
print(doc)

Now for the XPAth, given the HTML snippet above, I tried the following options:

reviewCounts <- xpathSApply(doc, "//div[@class='a-row']/span[contains(data-hook='total-review-count']", xmlValue)

Returns error:

Error in xpathApply.XMLInternalDocument(doc, path, fun, ..., namespaces = namespaces,  : 
  error evaluating xpath expression //div[@class='a-row']/span[contains(data-hook='total-review-count']

The next options all return List of 0:

reviewCounts <- xpathSApply(doc, "//span[@class='a-size-medium totalReviewCount']", xmlValue)

reviewCounts <- xpathSApply(doc, "//div[@class='a-row']/span[@class='a-size-medium totalReviewCount']", xmlValue)

reviewCounts <- xpathSApply(doc, "//div[@class='a-size-medium totalReviewCount']", xmlValue)

I did get it to go once, but when I re-ran the code, it did not work. Is the https causing the doc to not be true or is my XPATH wrong?

Jota
  • 17,281
  • 7
  • 63
  • 93
Bryan Butler
  • 1,750
  • 1
  • 19
  • 19
  • Here is quick thought to help you troubleshoot. Try `grepl("1,284", pageData, fixed=TRUE)`. Did you find anything? If not, take a closer look at what you got from `getURL`. Maybe it's not what you expect. You can also see how the output renders in a browser. – Jota Jan 11 '18 at 04:30
  • Thanks, came up as FALSE, so it looks like I am not getting my page data as expected. Looks like the HTTPS may still be an issue. – Bryan Butler Jan 11 '18 at 14:23
  • Take a closer look at `doc`. Literally look at everything that got returned. That should help you see what's going on. Some strong hints: https://stackoverflow.com/a/29647127 & https://stackoverflow.com/a/25937108. – Jota Jan 12 '18 at 03:48
  • I took at look, it's that Amazon does not want people to scrape views anymore. I used to do it all the time. They return you a page of junk if they think you are scraping. – Bryan Butler Jan 27 '21 at 22:38

0 Answers0