8

I tried converting .doc to HTML by using WordToHtmlConverter and it worked perfectly.

But when i tried to convert .docx to HTML, i got stuck with it.

What i tried:

I used the below code to convert .docx to HTML:

The code which i tried from : How to use Tika's XWPFWordExtractorDecorator class?

        InputStream input = TikaInputStream.get(new File("C:\\Users\\Downloads\\filename.docx"));


        Parser parser = new AutoDetectParser();


        StringWriter sw = new StringWriter();
        SAXTransformerFactory factory = (SAXTransformerFactory)
                 SAXTransformerFactory.newInstance();
        TransformerHandler handler = factory.newTransformerHandler();
        handler.getTransformer().setOutputProperty(OutputKeys.METHOD, "html");
        handler.getTransformer().setOutputProperty(OutputKeys.INDENT, "yes");
        handler.setResult(new StreamResult(sw));


        try {
            Metadata metadata = new Metadata();
            parser.parse(input, handler, metadata, new ParseContext());
            String xml = sw.toString();
            System.out.print("tika : "+xml); 
        } finally {
            input.close();
        }

The output what i got is,

<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title/>
</head>
<body/>
</html>
  • Please explain where i gone wrong?
  • Is there any better way to convert .docx to html string

Appreciate your help, Thanks

Community
  • 1
  • 1
Vignesh Paramasivam
  • 2,360
  • 5
  • 26
  • 57
  • According to the documentation https://poi.apache.org/apidocs/org/apache/poi/hwpf/converter/WordToHtmlConverter.html this API is meant to be used up to Word 2007 when there were only .doc . So it won't work for .docx with this API. Try so save your document in .doc – singe3 Jul 09 '14 at 11:57
  • @singe31 you dint get my point. I have converted .doc to html by using hwpf converter. But im trying to do it for .docx, is there a way? – Vignesh Paramasivam Jul 09 '14 at 12:02
  • 1
    https://code.google.com/p/xdocreport/wiki/XWPFConverterXHTML – singe3 Jul 09 '14 at 12:05
  • At their simplest `.docx` files are an archive (you can open them with something like 7zip and view the contents) containing a bunch of XML files. With that in mind, you'd want to use something that can transform the XML into HTML. – JonK Jul 09 '14 at 12:08
  • You could also take a look on [Pandoc](http://johnmacfarlane.net/pandoc/) or any other command line tool from Java. These tasks are not that trivial and I'm not sure if there's a a working API out there for that other than POI ATM. – rlegendi Jul 09 '14 at 12:24
  • i figured it out by using the link : code.google.com/p/xdocreport/wiki/XWPFConverterXHTML. i'll just post it as answer, it might help someone. Thank you all for your sugesstions. – Vignesh Paramasivam Jul 09 '14 at 12:33
  • You can use docx4j for that, see the example: https://github.com/plutext/docx4j/blob/master/src/samples/docx4j/org/docx4j/samples/ConvertOutHtml.java – Alexander Davliatov Apr 11 '17 at 23:11

2 Answers2

10

This code worked for me to convert .docx to html:

You can also look at the link : Link to code

       //convert .docx to HTML string
        InputStream in= new FileInputStream(new File(path));
        XWPFDocument document = new XWPFDocument(in);


        XHTMLOptions options = XHTMLOptions.create().URIResolver(new FileURIResolver(new File("word/media")));

        OutputStream out = new ByteArrayOutputStream();


        XHTMLConverter.getInstance().convert(document, out, options);
        String html=out.toString();
        System.out.println(html);
Vignesh Paramasivam
  • 2,360
  • 5
  • 26
  • 57
  • Could anyone provide an updated example please? The reference does not work anymore. Thanks. – Andres Dec 15 '16 at 19:16
  • 1
    I was getting problem using this code, as I was not able to get the jar for XHTMLOptions, XHTMLConverter and FileURIResolver and then when I searched I got these jars here "org.apache.poi.xwpf.converter.core-1.0.6.jar", "org.apache.poi.xwpf.converter.xhtml-1.0.6.jar" and "ooxml-schemas-1.1.jar", if you use these jars you will not get any kind of error with the above code – Vipul Jain Jan 17 '17 at 13:12
  • 2
    @Vipul here you have dependency with it https://mvnrepository.com/artifact/fr.opensagres.xdocreport/org.apache.poi.xwpf.converter.xhtml/1.0.6 – DanteVoronoi Jul 10 '17 at 08:15
  • Thanks all for saving my time. – Vishal Zanzrukia Jan 31 '18 at 13:32
  • 1
    I have followed the above code it's converting docx to html. But i didn't get border styles which are applied in my docx!!. Any idea?????? – Jay Feb 23 '18 at 06:52
  • Images are not working in my case. Is there a fix for it? – David Christie Aug 28 '20 at 08:21
2

You may want to make use of Mammoth docx to HTML library.Its a library for displaying doc, docx documents by converting them to html on the browser side as well as can be handled on the backend.