4

The title may be a little confusing. The simplest method must be judging by extension name just like:

// is represents the InputStream   
if (filePath.endsWith("doc")) {
    WordExtractor ex = new WordExtractor(is);
    text = ex.getText();
    ex.close();
} else if(filePath.endsWith("docx")) {
    XWPFDocument doc = new XWPFDocument(is);
    XWPFWordExtractor extractor = new XWPFWordExtractor(doc);
    text = extractor.getText();
    extractor.close();
}

This works in most cases. But I have found that for certain file whose extension is doc (a docx file essentially) if you open using winrar, you will find xml files. As it is known that a docx file is a zip file consists of xml files. I believe this problem must not be rare. But I have not found any information about this. Obviously, judging by extension name to read a doc or docx is not appropriate.

In my case, I have to read a lot of files. And I will even read the doc or docx inside a compressed file, zip, 7z or even rar. Hence, I have to read content by inputStream instead of a File or something else. So how to know whether a file is .docx or .doc format from Apache POI is totally not suitable for my case with ZipInputStream.

What is the best way to judge a file is a doc or docx? I want a solution to read the content from a file which may be doc or docx. But not only just simply judge if it is a doc or docx. Apparently, ZipInpuStream is not a good method for my case. And I believe it is not a appropriate method for others either. Why do I have to judge if the file is doc or docx by an exception?

eebbesen
  • 5,070
  • 8
  • 48
  • 70
neal
  • 164
  • 3
  • 15
  • https://stackoverflow.com/questions/41711627/how-to-know-whether-a-file-is-docx-or-doc-format-from-apache-poi –  Nov 25 '17 at 05:52
  • @ClayFerguson please read my question carefully, and I have seen this. I want to obtain a appropriate way to read a doc or a docx file. – neal Nov 25 '17 at 05:55
  • Possible duplicate of [how to know whether a file is .docx or .doc format from Apache POI](https://stackoverflow.com/questions/41711627/how-to-know-whether-a-file-is-docx-or-doc-format-from-apache-poi) – STaefi Nov 25 '17 at 05:58
  • I also can't tell how your question is not answered by @ClayFerguson's link. The referenced solution gives a simple way to test if the file is a Zip file...thereby distinguishing between doc and docx. – lockcmpxchg8b Nov 25 '17 at 06:00
  • @STaefi Please read my question carefully !!!!!!!!!!!!!!!!!!!!!!!!!!!! – neal Nov 25 '17 at 06:00
  • I think you should perhaps take a pass at revising your question, if three of us cannot distinguish what you're asking from existing solutions – lockcmpxchg8b Nov 25 '17 at 06:01
  • @lockcmpxchg8b use that method will bring problem to read a doc file, you even do not have a test – neal Nov 25 '17 at 06:02
  • 2
    @neal, so once you detect it's a zip file, you are still going to try to treat it as a 'doc' file? Yes that will "bring problem". –  Nov 25 '17 at 06:04
  • I don't quite follow. Are you unable to wrap your InputStream with a BufferedInputStream and mark/reset it after testing whether it's a ZIP, so that you can parse correctly as .doc or .docx? – lockcmpxchg8b Nov 25 '17 at 06:05
  • @ClayFerguson when you are reading a normal doc file, it will be a problem. – neal Nov 25 '17 at 06:06
  • @lockcmpxchg8b The cost will be too much. – neal Nov 25 '17 at 06:08
  • If you have constraints, you should list them in your question, rather than making us drag them out one at a time as we propose solutions. – lockcmpxchg8b Nov 25 '17 at 06:11
  • @neal, the whole point is that you will get a failure if it is a 'doc' file, and that is correct, and good code still. If it fails to run as a zip then that tells you it's not a zip. This is correct. It is not a problem. What you are thinking is the 'problem' is actually just the 'test' to see if it's a zip or not. If it fails to work as a zip, then you go ahead and assume it's a 'doc' –  Nov 25 '17 at 06:13
  • Maybe subclass InputStream to split into two output channels, so you can do trial parses for both .doc and .docx in parallel threads? – lockcmpxchg8b Nov 25 '17 at 06:14
  • @lockcmpxchg8b I have stressed it in my question. I want a solution to read a file which may be doc or docx but not just simply tell if the file if doc or docx. – neal Nov 25 '17 at 06:14
  • But you've given the code to do the parsing already, we're just proposing a different test in your conditional. Are you just asking whether there is a library to do this already? – lockcmpxchg8b Nov 25 '17 at 06:16
  • @neal, ok the answer is no then. It doesn't exist. You have to do what we told you already. If performance is a concern look here: https://stackoverflow.com/questions/33934178/how-to-identify-a-zip-file-in-java and google "how to detect if a stream is a zip file" but you are going to have to keep the two blocks of code you already have and detect the file type and then do the right thing. There may not be a way you can to it and write "zero code" which seems to be your goal. –  Nov 25 '17 at 06:19

2 Answers2

4

Using the current stable apache poi version 3.17 you may use FileMagic. But internally this will of course also have a look into the files.

Example:

import java.io.InputStream;
import java.io.FileInputStream;
import java.io.BufferedInputStream;

import org.apache.poi.poifs.filesystem.FileMagic;

import org.apache.poi.hwpf.extractor.WordExtractor;
import org.apache.poi.xwpf.extractor.XWPFWordExtractor;
import org.apache.poi.xwpf.usermodel.XWPFDocument;

public class ReadWord {

 static String read(InputStream is) throws Exception {

System.out.println(FileMagic.valueOf(is));

  String text = "";

  if (FileMagic.valueOf(is) == FileMagic.OLE2) {
   WordExtractor ex = new WordExtractor(is);
   text = ex.getText();
   ex.close();
  } else if(FileMagic.valueOf(is) == FileMagic.OOXML) {
   XWPFDocument doc = new XWPFDocument(is);
   XWPFWordExtractor extractor = new XWPFWordExtractor(doc);
   text = extractor.getText();
   extractor.close();
  }

  return text;

 }

 public static void main(String[] args) throws Exception {

  InputStream is = new BufferedInputStream(new FileInputStream("ExampleOLE.doc")); //really a binary OLE2 Word file
  System.out.println(read(is));
  is.close();

  is = new BufferedInputStream(new FileInputStream("ExampleOOXML.doc")); //a OOXML Word file named *.doc
  System.out.println(read(is));
  is.close();

  is = new BufferedInputStream(new FileInputStream("ExampleOOXML.docx")); //really a OOXML Word file
  System.out.println(read(is));
  is.close();

 }
}
Axel Richter
  • 56,077
  • 6
  • 60
  • 87
  • thanks very much!! There is finally an awesome solution. I will try to read the implementation of this. – neal Nov 25 '17 at 08:48
0
try {
    new ZipFile(new File("/Users/giang/Documents/a.doc"));
    System.out.println("this file is .docx");
} catch (ZipException e) {
    System.out.println("this file is not .docx");
    e.printStackTrace();
}
yelliver
  • 5,648
  • 5
  • 34
  • 65