0

I need to convert .docx file content to HTML text in order to display in web ui.

I've used Apache POI's XWPFDocument class but haven't been able to get any results yet; getting empty string. My code is based on this sample.

Here's also my code:

public JSONObject uploadDocxFile(MultipartFile multipartFile) throws Exception {
        InputStream inputStream = multipartFile.getInputStream();
        XWPFDocument wordDocument = new XWPFDocument(inputStream);

        WordToHtmlConverter wordToHtmlConverter = new WordToHtmlConverter(DocumentBuilderFactory.newInstance().newDocumentBuilder().newDocument());
        org.w3c.dom.Document htmlDocument = wordToHtmlConverter.getDocument();
        ByteArrayOutputStream out = new ByteArrayOutputStream();
        DOMSource domSource = new DOMSource(htmlDocument);
        StringWriter stringWriter = new StringWriter();

        TransformerFactory tf = TransformerFactory.newInstance();
        Transformer serializer = tf.newTransformer();
        serializer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
        serializer.setOutputProperty(OutputKeys.INDENT, "yes");
        serializer.setOutputProperty(OutputKeys.METHOD, "html");
        serializer.transform(domSource, new StreamResult(stringWriter));
        out.close();

        String result = new String(out.toByteArray());
        String htmlText = result;

        JSONObject jsonObject = new JSONObject();
        jsonObject.put("content", htmlText);
        jsonObject.put("success", true);
        return jsonObject;
    }
Community
  • 1
  • 1
talha06
  • 6,206
  • 21
  • 92
  • 147
  • possible duplicate of [Converting a .docx to html using Apache POI and getting no text](http://stackoverflow.com/questions/13103421/converting-a-docx-to-html-using-apache-poi-and-getting-no-text) – Robert Longson Jan 22 '13 at 14:49
  • there's no proper answer at there.. owner of question opened that question with the same reason with me; but he added a comment that he has no problem while getting text. – talha06 Jan 22 '13 at 19:23

3 Answers3

1

even if it's too late I think that the previous code can be modified in this way (it works with word97 document)

    private static void convertWordDoc2HTML(File file)
    throws ParserConfigurationException, TransformerConfigurationException,TransformerException, IOException {       
    //change the type from XWPFDocument to HWPFDocument
    HWPFDocument hwpfDocument = null;
    try {
        FileInputStream fis = new FileInputStream(file);
        POIFSFileSystem fileSystem = new POIFSFileSystem(fis);          
             hwpfDocument = new HWPFDocument(fileSystem);

    } catch (IOException ex) {
        ex.printStackTrace();
    }

    WordToHtmlConverter wordToHtmlConverter = new   WordToHtmlConverter(DocumentBuilderFactory.newInstance().newDocumentBuilder().newDocument());
    org.w3c.dom.Document htmlDocument = wordToHtmlConverter.getDocument();
    //add processDocument method 
    wordToHtmlConverter.processDocument(hwpfDocument);
    ByteArrayOutputStream out = new ByteArrayOutputStream();
    DOMSource domSource = new DOMSource(htmlDocument);
    StreamResult streamResult = new StreamResult(out);

    TransformerFactory tf = TransformerFactory.newInstance();
    Transformer serializer = tf.newTransformer();
    serializer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
    serializer.setOutputProperty(OutputKeys.INDENT, "yes");
    serializer.setOutputProperty(OutputKeys.METHOD, "html");
    serializer.transform(domSource, streamResult);
    out.close();

    String result = new String(out.toByteArray());

    String htmlText = result;
    System.out.println(htmlText);

    }

I hope it can be usefull.

0

I am using docx4j to do this and it seems to be working. If you're using Maven you can just add the dependency (but use version 3.0.0) and then use one of the docx4j sample programs called ConvertOutHtml.java. Just change the filepath in ConvertOutHtml.java to point to your file and you should be fine.

Jason Pather
  • 1,127
  • 2
  • 12
  • 18
0

Your code is generating an empty html output because you are not processing any document in the converter.

Anyway, if it is a docx you should be using XHTMLConverter to convert it to HTML instead of WordToHtmlConverter. See this answer

Community
  • 1
  • 1
Guga
  • 81
  • 4