1

I want to use JSoup to extract some data from the reviews section on Amazon, and then store the data in a HashMap.

For a given Amazon product, I want to extract some of the reviewers' names and impact. A reviewer's impact is a number available on the reviewer's public profile page.

Extracting the reviewers' names works fine but I'm having a problem extracting the impact (see code and error message below).

Thanks for any help!

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.IOException;
import java.util.HashMap;


public class Question {

    public static void main(String[] args) throws IOException {
    
        HashMap<String, String> reviewers = new HashMap<String, String>();
    
        Document reviewPage = Jsoup.connect("https://www.amazon.co.uk/Charles-Dickens-Complete-Christmas-Collection/dp/B08FRBWTNX/ref=sr_1_1_sspa?crid=USM5FCL8WJZ4&keywords=charles+dickens&qid=1678359627&sprefix=charles+dickens%2Caps%2C127&sr=8-1-spons&sp_csd=d2lkZ2V0TmFtZT1zcF9hdGY&psc=1").get();
        Elements reviewPageElements = reviewPage.select(".review");
    
        for (Element reviewPageElement : reviewPageElements) {
        
            // reviewer's name 
            Element nameElement = reviewPageElement.getElementsByClass("a-profile-name").first();    
            String name = nameElement.text();
        
            // reviwer's profile page
            Element linkElement = reviewPageElement.getElementsByClass("a-profile").first();    
            String link = linkElement.attr("href"); 
            String url = "https://www.amazon.co.uk" + link;
         
            // reviwer's impact 
            Document profilePage = Jsoup.connect(url).get();                    
            Elements impactElement = profilePage.getElementsByClass("impact-text");         
            String impact = impactElement.text();
        
            reviewers.put(name, impact);                                        
    
        }
    }
}

ERROR MESSAGE:
Exception in thread "main" org.jsoup.HttpStatusException: HTTP error fetching URL. Status=503, URL=https://www.amazon.co.uk/Charles-Dickens-Complete-Christmas-Collection/dp/B08FRBWTNX/ref=sr_1_1_sspa?crid=USM5FCL8WJZ4&keywords=charles+dickens&qid=1678359627&sprefix=charles+dickens%2Caps%2C127&sr=8-1-spons&sp_csd=d2lkZ2V0TmFtZT1zcF9hdGY&psc=1
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:459)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:475)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:434)
at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:181)
at org.jsoup.helper.HttpConnection.get(HttpConnection.java:170)
at Question.main(Question.java:48)
Robbie
  • 45
  • 5
  • What does `link` contain? – Robert Harvey Mar 09 '23 at 12:06
  • Should getElementsByClass("impact-text") -> getElementsByClass("impact-text").first() I ask because its getElements, multiple, and if you are trying to access it as a single element could have issues. – ialexander Mar 09 '23 at 12:39
  • @RobertHarvey When I try to extract the url of a profile (eg "https://www.amazon.co.uk/gp/profile/amzn1.account.AHCM4I7XQIND6LKBD3GLYIQBLEMA"), for some reason I only get a partial url (eg "/gp/profile/amzn1.account.AHCM4I7XQIND6LKBD3GLYIQBLEMA"). I call the partial url "link", and then I manually add the missing part ""https://www.amazon.co.uk" and call the complete url for the profile "url". – Robbie Mar 09 '23 at 13:40
  • The whole link works in a web browser. Sounds like amazon.co.uk is not happy with your "user agent" string or some other problem with your headers. Sometimes websites insist on actual browsers to discourage the very thing that you're trying to do. – Robert Harvey Mar 09 '23 at 13:45
  • @ialexander I still get the same error – Robbie Mar 09 '23 at 13:58
  • 1
    @Robbie Jsoup.connect(url).userAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3").get(); Try changing the user agent. They may be using anti-scaping tech. See here for changing that: https://stackoverflow.com/questions/6581655/jsoup-useragent-how-to-set-it-right – ialexander Mar 09 '23 at 14:25
  • Check your browser's requests for the impact request - maybe it is a `post` request that requires additional data, such as cookies or other fields and not just a simple `get`. – TDG Mar 10 '23 at 01:53

0 Answers0