15

On September 28, 2009 the Apache POI project released version 3.5 which officially supports the OOXML formats introduced in Office 2007, like DOCX and XLSX.

Please provide a code sample for extracting a DOCX file's content in plain text, ignoring any styles or formatting.

I am asking this because I have been unable to find any Apache POI examples covering the new OOXML support.

Todd Main
  • 28,951
  • 11
  • 82
  • 146
Robert Campbell
  • 6,848
  • 12
  • 63
  • 93

2 Answers2

21

This worked for me. Make sure you add the required jars (upgrade xmlbeans, etc.)

public String extractText(InputStream in) throws Exception {
    XWPFDocument doc = new XWPFDocument(in);
    XWPFWordExtractor ex = new XWPFWordExtractor(doc);
    String text = ex.getText();
    return text;
}
7

This is more generic

POITextExtractor poitex = ExtractorFactory.createExtractor(in);

return poitex.getText();