0

I am trying to read a Word document from a web URL using POI version 3.6. Non-working code:

String url = "http://prevention.cancer.gov/sites/default/files/uploads/clinical_trial/Master-DMP-Template.doc";
InputStream inputStream = new URL(urlString).openStream();
HWPFDocument doc = new HWPFDocument(inputStream);
WordExtractor extractor = new WordExtractor(doc);
String text = extractor.getText();

Above code results in java.io.IOException: Unable to read entire header; 6 bytes read; expected 32 bytes

Attempt 2: the interesting part is that downloading the file (just pasting the URL in the browser address bar), and then executing similar code for reading the doc locally does work:

InputStream inputStream = new FileInputStream("C:\\Users\\me\\Downloads\\Master-DMP-Template (2).doc");
HWPFDocument doc = new HWPFDocument(inputStream);
WordExtractor extractor = new WordExtractor(doc);
System.out.println(extractor.getText());

Attempt 3: and now the strangest part. I thought that the file needs to be downloaded first. So I downloaded it first using Java, and then executed the previous code for reading the doc locally. Fails like the first case!

final String url = "http://prevention.cancer.gov/sites/default/files/uploads/clinical_trial/Master-DMP-Template.doc";
String localPath  = FileUtils.downloadFile("C:\\Users\\me\\Downloads", url);
InputStream inputStream = new FileInputStream(localPath);
HWPFDocument doc = new HWPFDocument(inputStream);
WordExtractor extractor = new WordExtractor(doc);
System.out.println(extractor.getText());

public static String downloadFile(String targetDir, String sourceUrl) throws IOException {
    sourceUrl = StringUtils.removeEnd(sourceUrl, "/");
    String fileName = sourceUrl.substring(sourceUrl.lastIndexOf("/") + 1);
    String targetPath = targetDir + FileUtils.SEPARATOR + fileName;
    InputStream in = new URL(sourceUrl).openStream();
    Files.copy(in, Paths.get(targetPath), StandardCopyOption.REPLACE_EXISTING);
    System.out.println("Downloaded " + sourceUrl + " to " + targetPath);
    return targetPath;
}

Any idea what is going on here?

An update: I created a separate project for trying with POI 4.1.0. Same code (of first attempt) results in org.apache.poi.EmptyFileException: The supplied file was empty (zero bytes long)

I tried pasting the URL in the browser after hitting F12 and observing the Network tab. The message that appears there is: Resource interpreted as Document but transferred with MIME type application/msword: "https://prevention.cancer.gov/sites/default/files/uploads/clinical_trial/Master-DMP-Template.doc".

I am still stuck...

An update: as https://stackoverflow.com/users/3915431/axel-richter pointed out, there is a 301 redirecto to https://prevention.cancer.gov/sites/default/files/uploads/clinical_trial/Master-DMP-Template.doc . However, now I am running into strange problems that are not related to Word. Followig code fails:

public static void main(String[] args) {
    try {
        if (args.length > 0 && args[0].equals("disableCertValidation")) {
            SSLUtil.disableCertificateValidation(); // redirect is https
        }
        final String stringURL = "https://prevention.cancer.gov/sites/default/files/uploads/clinical_trial/Master-DMP-Template.doc";
        URL url = new URL(stringURL);
        HttpURLConnection con = (HttpURLConnection) url.openConnection();
        int responseCode = con.getResponseCode();
        System.out.println("Response code: " + responseCode); //301 Moved Permanently
        InputStream in = con.getInputStream();
        HWPFDocument doc = new HWPFDocument(in);
        WordExtractor extractor = new WordExtractor(doc);
        String text = extractor.getText();
        System.out.println(text);
        in.close();
    } catch (IOException e) {
        e.printStackTrace();
    }
}

When running main without an argument, the line

int responseCode = con.getResponseCode();

fails with following exception: javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target

When running the code with the disableCertificateValidation argument, the response code is 404 and I am getting following exception:

java.io.FileNotFoundException: https://prevention.cancer.gov/sites/default/files/uploads/clinical_trial/Master-DMP-Template.doc at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:422) at sun.net.www.protocol.http.HttpURLConnection$10.run(HttpURLConnection.java:1890) at sun.net.www.protocol.http.HttpURLConnection$10.run(HttpURLConnection.java:1885) at java.security.AccessController.doPrivileged(Native Method) at sun.net.www.protocol.http.HttpURLConnection.getChainedException(HttpURLConnection.java:1884) at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1457) at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1441) at sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(HttpsURLConnectionImpl.java:254) at com.keywords.control.util.TestHTMLParser.main(TestHTMLParser.java:472) Caused by: java.io.FileNotFoundException: https://prevention.cancer.gov/sites/default/files/uploads/clinical_trial/Master-DMP-Template.doc at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1836) at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1441) at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:480) at sun.net.www.protocol.https.HttpsURLConnectionImpl.getResponseCode(HttpsURLConnectionImpl.java:338) at com.keywords.control.util.TestHTMLParser.main(TestHTMLParser.java:470)

Any ideas?

Jacobs2000
  • 856
  • 2
  • 15
  • 25

2 Answers2

1

The initial HTTP request to your URL leads to a redirect 301 Moved Permanently. So we need handling this and reading the new location.

Complete example:

import java.io.InputStream;
import java.net.URL;
import java.net.HttpURLConnection;

import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.extractor.WordExtractor;

public class OpenHWPFFromURL {

 public static void main(String[] args) throws Exception {

  String stringURL = "http://prevention.cancer.gov/sites/default/files/uploads/clinical_trial/Master-DMP-Template.doc";

  URL url = new URL(stringURL);
  HttpURLConnection con = (HttpURLConnection)url.openConnection();

  int responseCode = con.getResponseCode();
  System.out.println(responseCode); //301 Moved Permanently

  if (responseCode != HttpURLConnection.HTTP_OK) {
   if (responseCode == HttpURLConnection.HTTP_MOVED_TEMP
       || responseCode == HttpURLConnection.HTTP_MOVED_PERM
       || responseCode == HttpURLConnection.HTTP_SEE_OTHER) {
    url = new URL(con.getHeaderField("Location")); //get new location
    con = (HttpURLConnection)url.openConnection();
   }   
  }

  InputStream in = con.getInputStream();
  HWPFDocument doc = new HWPFDocument(in);
  WordExtractor extractor = new WordExtractor(doc);
  String text = extractor.getText();

  System.out.println(text);

 }
}

Note: Simply setting HttpURLConnection.setFollowRedirects to true (what is the default as well) will not help if the redirect also changes the protocol (from HTTP to HTTPS for example). Exactly this is the case here too. So we need getting the new location manually as shown in my code.

Axel Richter
  • 56,077
  • 6
  • 60
  • 87
  • This looks like the right direction. Still, the code does not run successfully. There was a problem with https as the redirect is to https://prevention.cancer.gov/sites/default/files/uploads/clinical_trial/Master-DMP-Template.doc. I disabled certificate validation as described in https://stackoverflow.com/questions/875467/java-client-certificates-over-https-ssl/876785#876785 but now I am getting a 404 response code in the line int responseCode = con.getResponseCode(); Pasting the HTTPS URL in the browser does work and return 200 OK. Any idea how to make the code run? – Jacobs2000 May 13 '19 at 06:17
  • @Jacobs2000: My code is a complete example and is tested and works for me. For me it is able getting the document from `HTTPS` via the connection's input stream. What error you get when running exactly my code? Are you behind a proxy server maybe? – Axel Richter May 13 '19 at 06:42
  • I updated the question to provide more details about the new problems I am running into. I tried running the code both on my laptop and on a remote server so this cannot be related to local proxy settings. It may be related however to JVM settings - I am not sure where to look. – Jacobs2000 May 13 '19 at 18:52
  • @Jacobs2000: Sorry, cannot help further. As said for me it works. As of your error, it lacks a Certificate Authority (CA) certificate in `Java`s `cacerts`. Per default they are in `$JAVA_HOME/jre/lib/security/cacerts`. And btw.: Simply `disableCertificateValidation`, as you tried, is **not to recommend**, even if it would work. – Axel Richter May 14 '19 at 05:26
  • @Jacobs2000: The Certificate Authority (CA) certificate it lacks seems to be "Let's Encrypt Authority X3". But current `Java` versions will have that CA in their `cacerts`. So you seems to have a very old outdated `Java`version running, haven't you? – Axel Richter May 14 '19 at 06:34
  • I am running Java 8 – Jacobs2000 May 15 '19 at 17:47
  • @Jacobs2000: Me too. But then something else must be wrong with the `cacerts` or the access to it in your environment. But such issues are nearly not solvable without having direct access to your environment. Here it will lead to an endless discussion of trial and error: try this, try that, try else, ... Try getting help from a specialist directly in your office. – Axel Richter May 16 '19 at 03:55
  • I upgraded my JDK to version 8-211 and the problem is gone. The code finally runs:) – Jacobs2000 May 18 '19 at 20:42
0

This code new URL(urlString).openStream() return InputStream look here instead FileInputStream like this:

InputStream inputStream = new FileInputStream("C:\\Users\\me\\Downloads\\Master...")

Maybe problem in this difference?