Scraping A Webpage With JSOUP and Given An SSL Error. Is This A Site Specific Issue? (JSOUP Works On Other Websites)
I'm trying to run a scrape, I run scrapes like this all the time, but this one failed. Normally I use jsoup to connect to a webpage, and then grab what I want on the page. This one appears to be trying to do an ssl handshake or something and failing.
I found this page with a similar issue, but, I think the op is having that issue on all jsoup scrapes, where mine is specific to this one website. https://www.strack.de/de/shop/?idm=1162&d=1&idmp=94&spb=MTQ7NzQ7MTI0OzEyMzY7 I have tried multiple pages on this site and all have the same issue. However, all other sites that I have tried don't have this issue at all and scrape normally.
I tried installing the latest version of java and restarting the pc, this didn't lead to the ssl connecting successfully. I also tried going onto Firefox and downloading the certification. That didn't seem to have the same pathway as described in the answer.
"more info" > "security" > "show certificate" > "details" > "export.."
I think this issue might be caused by a separate problem as the scraper works just fine on other websites. This is why I created this as a separate question as opposed to a comment on that one.
Here is what happened when I tried to download the cert. Instead of show certificate there is a view certificate, and it doesn't have the details option nor an export option. Trying to get the .cert file, no prompt
Am I doing something wrong that is prompting a handshake or is this some sort of functionality that disallows scraping on this website? I tried to scrape the pricing off of this webpage: https://www.strack.de/de/shop/?idm=1162&d=1&idmp=94&spb=MTQ7NzQ7MTI0OzEyMzY7
I used JSOUP to try to scrape this page. It gave me an error. When I googled it, it seems to be an error that people get when trying to connect to servers.
It gave me this error:
Exception in thread "main" javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target at sun.security.ssl.Alerts.getSSLException(Alerts.java:192) at sun.security.ssl.SSLSocketImpl.fatal(SSLSocketImpl.java:1949) at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:302) at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:296) at sun.security.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1509) at sun.security.ssl.ClientHandshaker.processMessage(ClientHandshaker.java:216) at sun.security.ssl.Handshaker.processLoop(Handshaker.java:979) at sun.security.ssl.Handshaker.process_record(Handshaker.java:914) at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:1062) at sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375) at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403) at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387) at sun.net.www.protocol.https.HttpsClient.afterConnect(HttpsClient.java:559) at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.connect(AbstractDelegateHttpsURLConnection.java:185) at sun.net.www.protocol.https.HttpsURLConnectionImpl.connect(HttpsURLConnectionImpl.java:153) at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:732) at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:707) at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:297) at org.jsoup.helper.HttpConnection.get(HttpConnection.java:286) at scrapetestforstack.de.ScrapeTestForStackDe.main(ScrapeTestForStackDe.java:81) Caused by: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target at sun.security.validator.PKIXValidator.doBuild(PKIXValidator.java:387) at sun.security.validator.PKIXValidator.engineValidate(PKIXValidator.java:292) at sun.security.validator.Validator.validate(Validator.java:260) at sun.security.ssl.X509TrustManagerImpl.validate(X509TrustManagerImpl.java:324) at sun.security.ssl.X509TrustManagerImpl.checkTrusted(X509TrustManagerImpl.java:229) at sun.security.ssl.X509TrustManagerImpl.checkServerTrusted(X509TrustManagerImpl.java:124) at sun.security.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1491) ... 15 more Caused by: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target at sun.security.provider.certpath.SunCertPathBuilder.build(SunCertPathBuilder.java:141) at sun.security.provider.certpath.SunCertPathBuilder.engineBuild(SunCertPathBuilder.java:126) at java.security.cert.CertPathBuilder.build(CertPathBuilder.java:280) at sun.security.validator.PKIXValidator.doBuild(PKIXValidator.java:382) ... 21 more C:\Users\LeonardDME\AppData\Local\NetBeans\Cache\8.2\executor-snippets\run.xml:53: Java returned: 1 BUILD FAILED (total time: 0 seconds)
Here is the code that I am trying to do.
//Phase 3 Scrape The URL for Urls
Document doc = Jsoup.connect(URL).get();
title = doc.title();
TitleFixer = title.replaceAll(" ", "");
title = TitleFixer.replaceAll("|", "");
TitleFixer = title.replaceAll("|", "");
title = TitleFixer.replaceAll(";", "");
//Set file writing stuff 1
GimmeAName = ("C:\\Users\\LeonardDME\\Documents\\NetBeansProjects\\ScrapeTestForStackDe\\Urls\\" + title + ".csv");
File f = new File(GimmeAName);
FileWriter fw = new FileWriter(f);
PrintWriter out = new PrintWriter(fw);
StuffToWrite = URLArray[counter];
// fetch the document over HTTP
Elements spangrabbers = doc.getElementsByClass("art_orginal_preis142790");
for (Element spangrab : spangrabbers)
{
//System.out.println("New Span: ");
//System.out.println(spangrab);
holder2 = spangrab.text();
//System.out.println(holder2);
SpanHolderArray[SpanHolderCounter] = holder2;
SpanHolderCounter++;
}
// get all links in page
Elements links = doc.select("a[href]");
for (Element link : links) {
// get the value from the href attribute
checker = link.attr("href");
if (checker.contains("http"))
{
}
else if(checker.contains("javascript"))
{
}
else if(checker.contains("style"))
{
}
else
{
counter++;
if(LinkContorter == null && LinkContorter.isEmpty())
{
//do nothing
}
else
{
System.out.println(LinkContorter);
out.print(LinkContorter);
out.print(",");
out.print("\n");
//Flush the output to the file
out.flush();
}
}
}
System.out.println(counter);
//Close the Print Writer
out.close();
//Close the File Writer
fw.close();
Is it possible that a few of you could try to scrape this site, and see if you get the same result as me? I suspect that there might be some safegaurd against scraping, but, I don't want to abandon the task unless I know that to be the case for sure. I also used to scrape this same website a few months ago in February or March without an issue.