2

I currrently have some code that converts a .doc document to html but the code I am using for converting a .docx to text unfortunately doesn't get the text and convert it. Below is my code.

private void convertWordDocXtoHTML(File file) throws ParserConfigurationException, TransformerConfigurationException, TransformerException, IOException {
    XWPFDocument wordDocument = null;
    try {
        wordDocument = new XWPFDocument(new FileInputStream(file));
    } catch (IOException ex) {
        Exceptions.printStackTrace(ex);
    }

    WordToHtmlConverter wordToHtmlConverter = new WordToHtmlConverter(DocumentBuilderFactory.newInstance().newDocumentBuilder().newDocument());
    org.w3c.dom.Document htmlDocument = wordToHtmlConverter.getDocument();
    ByteArrayOutputStream out = new ByteArrayOutputStream();
    DOMSource domSource = new DOMSource(htmlDocument);
    StreamResult streamResult = new StreamResult(out);

    TransformerFactory tf = TransformerFactory.newInstance();
    Transformer serializer = tf.newTransformer();
    serializer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
    serializer.setOutputProperty(OutputKeys.INDENT, "yes");
    serializer.setOutputProperty(OutputKeys.METHOD, "html");
    serializer.transform(domSource, streamResult);
    out.close();

    String result = new String(out.toByteArray());
    acDocTextArea.setText(newDocText);
    String htmlText = result;

}

Any ideas as to why this isn't working would be much appreciated. The ByteArrayOutput should return the entire html but it is empty and has no text.

yams
  • 942
  • 6
  • 27
  • 60

2 Answers2

5

Mark, you're using HWPF package which supports only .doc format, see this description. The document also mentions attempts to provide the interface for .docx files, through XWPF package. However they seem to lack human resources and users are encouraged to submit extensions. Limited functionality should be available though, extracting the text must be one of them.

You should also see this question: How to Extract docx (word 2007 above) using apache POI.

Community
  • 1
  • 1
Jarekczek
  • 7,456
  • 3
  • 46
  • 66
  • 2
    I am using the XWPFDocument to get the text from the .docx which works great but what I am needing is to also convert the orignal .docx file to a html file. The text for a file I can get. But what I cannot get is the html version of the file. When I use the word extractor for this I get the text from the .docx. I just can't convert the file to html for some reason and no errors are given. – yams Oct 28 '12 at 16:27
  • You can also have a look at my example in here http://stackoverflow.com/questions/24652953/convert-docx-to-html-using-java – Vignesh Paramasivam Aug 22 '16 at 06:38
0

I too was struck at this point.
Now I know there is a 3rd party API to convert docx to html
works fine
https://code.google.com/p/xdocreport/wiki/XWPFConverterXHTML