0

In my Java Spring MVC web application, I am trying to use Apache Tika to extract text from a URL. I have added the following Maven dependency to my pom.xml file

<!-- Apache Tika -->
<dependency>
    <groupId>org.apache.tika</groupId>
    <artifactId>tika-core</artifactId>
    <version>1.6</version>
</dependency>

<dependency>
    <groupId>org.apache.tika</groupId>
    <artifactId>tika-parsers</artifactId>
    <version>1.6</version>
</dependency>

I am using the following code to get text content from a url:

URL url = new URL("http://www.basicsbehind.com/extract-text-webpage/");
InputStream input = url.openStream();
LinkContentHandler linkHandler = new LinkContentHandler();
ContentHandler textHandler = new BodyContentHandler();
ToHTMLContentHandler toHTMLHandler = new ToHTMLContentHandler();
TeeContentHandler teeHandler = new TeeContentHandler(linkHandler, textHandler, toHTMLHandler);
Metadata metadata = new Metadata();
ParseContext parseContext = new ParseContext();
Parser parser = new AutoDetectParser();
parser.parse(input, teeHandler, metadata, parseContext);
System.out.println("title:\n" + metadata.get("title"));
System.out.println("links:\n" + linkHandler.getLinks());
System.out.println("text:\n" + textHandler.toString());
System.out.println("html:\n" + toHTMLHandler.toString());

The code works fine and I am able to get the required data. But if i use an HTTPS url instead, the following exception is thrown:

javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
    at sun.security.ssl.Alerts.getSSLException(Alerts.java:192)
    at sun.security.ssl.SSLSocketImpl.fatal(SSLSocketImpl.java:1949)
    at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:302)
    at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:296)
    at sun.security.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1514)
    at sun.security.ssl.ClientHandshaker.processMessage(ClientHandshaker.java:216)
    at sun.security.ssl.Handshaker.processLoop(Handshaker.java:1026)
    at sun.security.ssl.Handshaker.process_record(Handshaker.java:961)
    at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:1062)
    at sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375)
    at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403)
    at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387)
    at sun.net.www.protocol.https.HttpsClient.afterConnect(HttpsClient.java:559)
    at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.connect(AbstractDelegateHttpsURLConnection.java:185)
    at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1546)
    at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1474)
    at sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(HttpsURLConnectionImpl.java:254)
    at java.net.URL.openStream(URL.java:1045)

This exception is not thrown for all HTTPS URLs, if I use the following code

URL url = new URL("https://www.smartwcm.com");

the exception is thrown. I am unable to figure out a solution for this.

Geo Thomas
  • 1,139
  • 3
  • 26
  • 59

0 Answers0