5

I am trying to read one file in java, following is the code :

public void readFile(String fileName){
        try {
        BufferedReader reader= new BufferedReader(new FileReader(fileName)); 
        String line=null;
        while((line=reader.readLine()) != null ){
            System.out.println(line);
        }
        }catch (Exception ex){}
            }

It is working fine in case of txt file. However in case of docx file, it is printing weird characters. How can i read .docx file in Java.

JasonPlutext
  • 15,352
  • 4
  • 44
  • 84
Addict
  • 803
  • 2
  • 9
  • 16
  • 1
    [Apache POI](http://poi.apache.org/) seems to be the most common library used to read Microsoft file formats. – jahroy May 22 '13 at 03:24
  • That is certainly true for Excel (.xls and .xlsx) – JasonPlutext Nov 28 '14 at 19:07
  • Possible duplicate of [How read Doc or Docx file in java?](http://stackoverflow.com/questions/7102511/how-read-doc-or-docx-file-in-java) – mlt Oct 06 '15 at 22:46
  • `.docx` files are not plain text files, which have a file extension of `.txt`; and they are encoded differently. You would need an API to read it, as suggested by @jahroy above. – Tech Expert Wizard Nov 11 '20 at 12:58

4 Answers4

13
import java.io.File;
import java.io.FileInputStream;
import java.util.List;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import org.apache.poi.xwpf.usermodel.XWPFParagraph;
    public void readDocxFile() {
            try {
                File file = new File("C:/NetBeans Output/documentx.docx");
                FileInputStream fis = new FileInputStream(file.getAbsolutePath());

                XWPFDocument document = new XWPFDocument(fis);

                List<XWPFParagraph> paragraphs = document.getParagraphs();


                for (XWPFParagraph para : paragraphs) {
                    System.out.println(para.getText());
                }
                fis.close();
            } catch (Exception e) {
                e.printStackTrace();
            }
        }
Raju Ahmed
  • 155
  • 1
  • 8
7

Internally .docx files are organized as zipped XML-files, whereas .doc is a binary file format. So you can not read either one directly. Have a look at docx4j or Apache POI.

If you are trying to create or manipulate a .docx file, try docx4j Here is the source

or go for apachePOI

morido
  • 1,027
  • 7
  • 24
2

You cannot read the docx file or doc file directly. You need to have an API to read word files. Use Apache POI http://poi.apache.org/. If you get any doubts, please refer this thread on stackoverflow.com How read Doc or Docx file in java?

Community
  • 1
  • 1
vkrams
  • 7,267
  • 17
  • 79
  • 129
2

you must have following 6 jar:

  1. xmlbeans-2.3.0.jar
  2. dom4j-1.6.1.jar
  3. poi-ooxml-3.8-20120326.jar
  4. poi-ooxml-schemas-3.8-20120326.jar
  5. poi-scratchpad-3.2-FINAL.jar
  6. poi-3.5-FINAL.jar

Code:

import java.io.File;
import java.io.FileInputStream;
import java.util.Iterator;
import java.util.List;
import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.extractor.WordExtractor;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import org.apache.poi.xwpf.usermodel.XWPFParagraph;
 
public class test {
 public static void readDocxFile(String fileName) {
try {
File file = new File(fileName);
FileInputStream fis = new FileInputStream(file.getAbsolutePath());
XWPFDocument document = new XWPFDocument(fis);
List<XWPFParagraph> paragraphs = document.getParagraphs();

for(int i=0;i<paragraphs.size();i++){
    System.out.println(paragraphs.get(i).getParagraphText());
}
fis.close();
} catch (Exception e) {
e.printStackTrace();
}
}
public static void main(String[] args) {
 readDocxFile("C:\\Users\\sp0c43734\\Desktop\\SwatiPisal.docx");
 }
} 
Swati Pisal
  • 531
  • 4
  • 5