Converting .doc to .html in Java using Apache POI

Question

I want to convert a document .doc that contains some images. How to convert it to *.html, so that the images will remain same position? How to store those images in separate folder named image and use this folder as a source for image?

My code:

import java.io.BufferedWriter;
import java.io.DataOutputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.OutputStreamWriter;
import java.io.StringWriter;
import javax.swing.JEditorPane;
import javax.swing.JFrame;
import javax.swing.JScrollPane;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.transform.OutputKeys;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;
import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.converter.WordToHtmlConverter;
import org.apache.poi.hwpf.extractor.WordExtractor;
import org.apache.poi.xwpf.converter.core.FileImageExtractor;
import org.apache.poi.xwpf.converter.core.FileURIResolver;
import org.apache.poi.xwpf.converter.xhtml.XHTMLOptions;
import org.w3c.dom.Document;

public class TestWordToHtmlConverter {
    private File docFile;
    private File file;

    public TestWordToHtmlConverter(File docFile) {
        this.docFile = docFile;
    }

    public void convert(File file) {
    this.file = file;

        try {
            FileInputStream finStream=new FileInputStream(docFile.getAbsolutePath()); 
            HWPFDocument doc=new HWPFDocument(finStream);
            WordExtractor wordExtract=new WordExtractor(doc);
            Document newDocument = DocumentBuilderFactory.newInstance() .newDocumentBuilder().newDocument();
            WordToHtmlConverter wordToHtmlConverter = new WordToHtmlConverter(newDocument) ;
            wordToHtmlConverter.processDocument(doc);

            StringWriter stringWriter = new StringWriter();
            Transformer transformer = TransformerFactory.newInstance().newTransformer();

            transformer.setOutputProperty(OutputKeys.INDENT, "yes");
            transformer.setOutputProperty(OutputKeys.ENCODING, "utf-8");
            transformer.setOutputProperty(OutputKeys.METHOD, "html");
            transformer.transform(new DOMSource( wordToHtmlConverter.getDocument()), new StreamResult( stringWriter ) );

            String html = stringWriter.toString();
            FileOutputStream fos=new FileOutputStream(new File("html/sample.html"));
            DataOutputStream dos;

            try {
                BufferedWriter out = new BufferedWriter(new OutputStreamWriter(fos,"UTF-8"));    
                out.write(html);
                out.close();
            }
            catch (IOException e) {
               e.printStackTrace();
            }

           /*JEditorPane editorPane = new JEditorPane();
           editorPane.setContentType("text/html");
           editorPane.setEditable(false);

           editorPane.setPage(file.toURI().toURL());

           JScrollPane scrollPane = new JScrollPane(editorPane);     
           JFrame f = new JFrame("Display Html File");
           f.setDefaultCloseOperation(JFrame.EXIT_ON_CLOSE);
           f.getContentPane().add(scrollPane);
           f.setSize(512, 342);
           f.setVisible(true);*/

        } catch(Exception e) {
            e.printStackTrace();
        }
    }  

    public static void main(String args[]) {
        TestWordToHtmlConverter TTC=new TestWordToHtmlConverter(new File("docx/sample.doc"));
        TTC.convert(TTC.docFile);         
   }
}

This implementation doesn't create images or links to them. This can be changed by overriding AbstractWordConverter.processImage(Element, boolean, Picture) method

score 4 · Answer 1 · edited May 23 '17 at 11:52

4

As said in API docs:

WordToHtmlConverter doesn't create images or links to them. This can be changed by overriding AbstractWordConverter.processImage(Element, boolean, Picture) method.

How to override method you can found here:

I need Apache POI Pictures converted from a word document to a html file

You can try using DOCX 2 XHTML converter based on Apache POI XWPF:

XWPFDocument 2 XHTML

Also you can use Apache Tika, built on top of Apache POI. An example that included in Alfresco can be found here:

HTMLRenderingEngine

There are also many other converters.

edited May 23 '17 at 11:52

Community

1
1

answered Apr 27 '14 at 14:58

Thanks ...now i got solution – sudhakar810 Apr 29 '14 at 18:22
now i got solution for images and its working correctly. But there is a problem related to bullets and Numbering.Those paragraphs that contains list visible correctly.Numbering is missing. – sudhakar810 May 08 '14 at 06:59
1

The API Doc is wrong. They probably forgot to remove that comment when they implemented the image processing feature. It already does implement the link creation, you just need to handle the image storage (by implementing the PicturesManager interface and setting it to the converter). – Guga Jan 26 '16 at 13:47

Converting .doc to .html in Java using Apache POI

1 Answers1