4

There is a DOC file that contains some image. How to convert it to HTML with image?

I tried to use this example: Convert Word doc to HTML programmatically in Java

public class Converter {
    ...

    private File docFile, htmlFile;

    try {
        FileInputStream fos = new FileInputStream(docFile.getAbsolutePath()); 
        HWPFDocument doc = new HWPFDocument(fos);       
        Document newDoc = DocumentBuilderFactory.newInstance().newDocumentBuilder().newDocument();

        WordToHtmlConverter wordToHtmlConverter = new WordToHtmlConverter(newDoc) ;
        wordToHtmlConverter.processDocument(doc);

        StringWriter stringWriter = new StringWriter();

        Transformer transformer = TransformerFactory.newInstance().newTransformer();        
        transformer.setOutputProperty(OutputKeys.INDENT, "yes");
        transformer.setOutputProperty(OutputKeys.ENCODING, "utf-8");
        transformer.setOutputProperty(OutputKeys.METHOD, "html");
        transformer.transform(
                    new DOMSource(wordToHtmlConverter.getDocument()),
                    new StreamResult(stringWriter)
        );

        String html = stringWriter.toString();

        try {
            BufferedWriter out = new BufferedWriter(
                new OutputStreamWriter(new FileOutputStream(htmlFile), "UTF-8")
            );     
            out.write(html);
            out.close();
       } catch (IOException e) {
           e.printStackTrace();
       }

       JEditorPane jEditorPane = new JEditorPane();
       jEditorPane.setContentType("text/html");
       jEditorPane.setEditable(false);
       jEditorPane.setPage(htmlFile.toURI().toURL());

       JScrollPane jScrollPane = new JScrollPane(jEditorPane);

       JFrame jFrame = new JFrame("display html file");
       jFrame.setDefaultCloseOperation(JFrame.EXIT_ON_CLOSE);
       jFrame.getContentPane().add(jScrollPane);
       jFrame.setSize(512, 342);
       jFrame.setVisible(true);

    } catch(Exception e) {
        e.printStackTrace();
    }
    ...
}

But the image is lost.

The documentation for the WordToHtmlConverter class says the following:

...this implementation doesn't create images or links to them. This can be changed by overriding AbstractWordConverter.processImage(Element, boolean, Picture) method.

How to convert DOC to HTML with images?

tripleee
  • 175,061
  • 34
  • 275
  • 318

2 Answers2

5

Extend WordToHtmlConverter and override processImageWithoutPicturesManager.

 import java.util.Base64;

import org.apache.poi.hwpf.converter.WordToHtmlConverter;
import org.apache.poi.hwpf.usermodel.Picture;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
public class InlineImageWordToHtmlConverter extends WordToHtmlConverter {

    public InlineImageWordToHtmlConverter(Document document) {
        super(document);
    }

    @Override
    protected void processImageWithoutPicturesManager(Element currentBlock,
        boolean inlined, Picture picture)
    {
        Element imgNode = currentBlock.getOwnerDocument().createElement("img");
        StringBuilder sb = new StringBuilder();
        sb.append(Base64.getMimeEncoder().encodeToString(picture.getRawContent()));
        sb.insert(0, "data:"+picture.getMimeType()+";base64,");
        imgNode.setAttribute("src", sb.toString());
        currentBlock.appendChild(imgNode);
    }

}

Use the new class while parsing document as shown below

HWPFDocumentCore wordDocument = WordToHtmlUtils.loadDoc(new FileInputStream("D:/temp/Temp.doc"));    
        WordToHtmlConverter wordToHtmlConverter = new InlineImageWordToHtmlConverter(
                DocumentBuilderFactory.newInstance().newDocumentBuilder()
                        .newDocument());
        wordToHtmlConverter.processDocument(wordDocument);
raok1997
  • 369
  • 5
  • 12
2

Your best bet in this case is to use Apache Tika, and let it wrap Apache POI for you. Apache Tika will generate HTML for your document (or plain text, but you want the HTML for your case). Along with that, it'll put in placeholders for embedded resources, img tags for embedded images, and provide you with a way to get at the contents of the embedded resources and images.

There's a very good example of doing this included in Alfresco, HTMLRenderingEngine. You'll likely want to review the code there, then write your own to do something very similar. The code there includes a custom ContentHandler which allows editing of the img tags, to re-write the src attributes, you may or may not need that depending on where you're going to write out the images to.

Gagravarr
  • 47,320
  • 10
  • 111
  • 156