How to use BufferedInputStream to read a large microsoft word document in Java 7?

Question

Is it possible to use the solution to this question for Microsoft Word files that are large?

In other words, will the following code work if I replace "file.txt" below with "file.doc" ?

final InputStream in = new BufferedInputStream(new FileInputStream("file.txt"));
final long start = System.currentTimeMillis();
int cnt = 0;
final byte[] buf = new byte[1000];
while (in.read(buf) != -1) cnt++;
in.close();
System.out.println("Elapsed " + (System.currentTimeMillis() - start) + " ms");

What do you want to do? Do you want to process the file somehow? This will read the .doc file which is binary format. If you just want to copy/send then it's alright. — jakub.petr, Apr 06 '15 at 20:22
I am trying to use Apache Tika to extract text out of a MS Word document. This works perfectly if the word document is not large, but I get java heap space errors if the word document is 100MB or larger. So I'm trying to figure out a way to break the large ms word document into chunks that are parsable by Apache Tika — user1068636, Apr 06 '15 at 21:17
You're assuming that's possible. It isn't. The code you have written will work but it won't accomplish your objeClive. — user207421, Apr 06 '15 at 21:54
`InputStream` doesn’t care whether the input is a word file or a plain text file. And the general answer is that `BufferedInputStream` is useless in 99% of all cases. It helps when you are using `read()` to read the input byte for byte which you simply shouldn’t do when you care for performance. In your example code the reading is already buffered, though the buffer size `1000` is smaller than the default buffer size `8192` of `BufferedInputStream`. Once you raise your buffer size to `8192` or higher, `BufferedInputStream` will pass your read request directly to the `FileInputStream`… — Holger, Apr 08 '15 at 15:43

score 0 · Answer 1 · answered Apr 07 '15 at 10:18

Try to convert your .doc(binary) to .docx(xml) first - ideally with some cmd line utility from MS.

Then the library for parsing (I am not familiar with Apache Tika) could use some XML parser (SAX) which is great for processing large files or you could even parse it yourself (the XML representation is readable).

score 0 · Answer 2 · answered Apr 08 '15 at 16:09

0

Have you tried

Path filePath = Paths.get("Your File Path", "Your File Name");
byte[] bytes = Files.readAllBytes(filePath);

For reference http://www.java2s.com/Tutorials/Java/java.nio.file/Files/Java_Files_readAllBytes_Path_path_.htm

answered Apr 08 '15 at 16:09

Shar1er80

9,001
2
20
29

How to use BufferedInputStream to read a large microsoft word document in Java 7?

2 Answers2