I need Apache POI Pictures converted from a word document to a html file

Question

I have some code that uses the Java Apache POI library to open a Microsoft word document and convert it to html, using the the Apache POI and it also gets the byte array data of images on the document. But I need to convert this information to html to write out to an html file. Any hints or suggestions would be appreciated. Keep in mind that I am a desktop dev developer and not a web programmer, so when you make suggestions, please remember that. The code below gets the image.

 private void parseWordText(File file) throws IOException {
      FileInputStream fs = new FileInputStream(file);
      doc = new HWPFDocument(fs);
      PicturesTable picTable = doc.getPicturesTable();
      if (picTable != null){
           picList = new ArrayList<Picture>(picTable.getAllPictures());
           if (!picList.isEmpty()) {
           for (Picture pic : picList) {
                byte[] byteArray = pic.getContent();
                pic.suggestFileExtension();
                pic.suggestFullFileName();
                pic.suggestPictureType();
                pic.getStartOffset();
           }
        }
     }

Then the code below this converts the document to html. Is there a way to add the byteArray to the ByteArrayOutputStream in the code below?

private void convertWordDoctoHTML(File file) throws ParserConfigurationException, TransformerConfigurationException, TransformerException, IOException {
    HWPFDocumentCore wordDocument = null;
    try {
        wordDocument = WordToHtmlUtils.loadDoc(new FileInputStream(file));
    } catch (IOException ex) {
        Exceptions.printStackTrace(ex);
    }

    WordToHtmlConverter wordToHtmlConverter = new WordToHtmlConverter(DocumentBuilderFactory.newInstance().newDocumentBuilder().newDocument());
    wordToHtmlConverter.processDocument(wordDocument);
    org.w3c.dom.Document htmlDocument = wordToHtmlConverter.getDocument();
    NamedNodeMap node = htmlDocument.getAttributes();


    ByteArrayOutputStream out = new ByteArrayOutputStream();
    DOMSource domSource = new DOMSource(htmlDocument);
    StreamResult streamResult = new StreamResult(out);

    TransformerFactory tf = TransformerFactory.newInstance();
    Transformer serializer = tf.newTransformer();
    serializer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
    serializer.setOutputProperty(OutputKeys.INDENT, "yes");
    serializer.setOutputProperty(OutputKeys.METHOD, "html");
    serializer.transform(domSource, streamResult);
    out.close();

    String result = new String(out.toByteArray());
    acDocTextArea.setText(newDocText);

    htmlText = result;

}

Have a look at this example, it uses POIs WordToHtmlConverter: http://stackoverflow.com/questions/7868713/convert-word-to-html-with-apache-poi — Udo Klimaschewski, Oct 30 '12 at 15:44
I already have that part of the code working, I am asking about how to get the pics into the html. You know the Picture list I created above. — yams, Oct 30 '12 at 16:52
So you mean, you want to code the picture directly into the HTML markup for your document, without doing an `` reference? There's a data URI that works on most modern browsers, e.g. ``. See http://en.wikipedia.org/wiki/Data_URI_scheme. — Udo Klimaschewski, Oct 30 '12 at 18:09
So Udo, I am not a web developer so how would I do either? and which would be easier to implement? — yams, Oct 30 '12 at 18:18
That depends on your needs, the reference solution will require a separate file for each image, the inline solution will have all images and html in one file. For implementing the first, you would simply save the images to a file and refernce the location in the `img` tag. For the second solution, you would have to convert the image to a Base64 string first and embed it directly to the `img` tag. You can search Stack Overflow and the web for more detail on how to do both. — Udo Klimaschewski, Oct 30 '12 at 18:52
Sounds like converting to a base64 string would be best. Thank you Udo you have been extremely helpful. — yams, Oct 30 '12 at 19:03
One more question how can I add the byte[] to my existing code that converts the document to a html? — yams, Oct 30 '12 at 19:31
Have you looked at using Apache Tika? That already provides a way to wrap up Apache POI, and output a HTML version along with any embedded resources (eg images), so you can avoid reinventing the wheel! — Gagravarr, Oct 30 '12 at 23:25
The OpenOffice converter JODConverter is also worth a try, I think: http://www.artofsolving.com/opensource/jodconverter — Udo Klimaschewski, Oct 31 '12 at 12:19
Udo I am unfortunately at a point where I need to use the Apache POI and continue on with what I have. — yams, Oct 31 '12 at 14:17

score 3 · Accepted Answer · answered Oct 31 '12 at 16:18

Looking at the source code for the org.apache.poi.hwpf.converter.WordToHtmlConverter at

http://svn.apache.org/viewvc/poi/trunk/src/scratchpad/src/org/apache/poi/hwpf/converter/WordToHtmlConverter.java?view=markup&pathrev=1180740

It states in the JavaDoc:

This implementation doesn't create images or links to them. This can be changed by overriding {@link #processImage(Element, boolean, Picture)} method

If you take a look at that processImage(...) method in AbstractWordConverter.java at line 790, it looks like the method is calling then another method named processImageWithoutPicturesManager(...).

http://svn.apache.org/viewvc/poi/trunk/src/scratchpad/src/org/apache/poi/hwpf/converter/AbstractWordConverter.java?view=markup&pathrev=1180740

This method is defined in WordToHtmlConverter again and looks suspiciously exact like the place you want to grow your code (line 317):

@Override
protected void processImageWithoutPicturesManager(Element currentBlock,
    boolean inlined, Picture picture)
{
    // no default implementation -- skip
    currentBlock.appendChild(htmlDocumentFacade.document
    .createComment("Image link to '"
    + picture.suggestFullFileName() + "' can be here"));
}

I think you have the point where to start inserting the images into the flow.

Create a subclass of the converter, e.g.

    public class InlineImageWordToHtmlConverter extends WordToHtmlConverter

and then override the method and place whatever code into it.

I haven't tested it, but it should be the right way from what I see theoretically.

Actually you don't need to override any method in the WordToHtmlConverter pointed by the links in the answer. The implementation already cover link creation. You just need to implement the PicturesManager interface (to save the pictures) and set the pictures manager of the converter. — Guga, Jan 26 '16 at 13:37

score 1 · Answer 2 · answered Sep 07 '17 at 12:18

@user4887078 It's straight forward just as @Guga said, all I did was to look org.apache.poi.xwpf.converter.core.FileImageExtractor and Voila! It sure works as expected, although it might still need some refactoring and optimization.

HWPFDocumentCore wordDocument = WordToHtmlUtils.loadDoc(is);

            WordToHtmlConverter wordToHtmlConverter = new WordToHtmlConverter(
                    DocumentBuilderFactory.newInstance().newDocumentBuilder()
                            .newDocument());
            wordToHtmlConverter.setPicturesManager(new PicturesManager() {
                @Override
                public String savePicture(byte[] bytes, PictureType pictureType, String s, float v, float v1) {
                    File imageFile = new File("pages/imgs", s);
                    imageFile.getParentFile().mkdirs();
                    InputStream in = null;
                    FileOutputStream out = null;

                    try {
                        in = new ByteArrayInputStream(bytes);
                        out = new FileOutputStream(imageFile);
                        IOUtils.copy(in, out);

                    } catch (FileNotFoundException e) {
                        e.printStackTrace();
                    } catch (IOException e) {
                        e.printStackTrace();
                    } finally {
                        if (in != null) {
                            IOUtils.closeQuietly(in);
                        }

                        if (out != null) {
                            IOUtils.closeQuietly(out);
                        }

                    }
                    return "imgs/" + imageFile.getName();
                }
            });
            wordToHtmlConverter.processDocument(wordDocument);
            Document htmlDocument = wordToHtmlConverter.getDocument();
            ByteArrayOutputStream out = new ByteArrayOutputStream();
            DOMSource domSource = new DOMSource(htmlDocument);
            StreamResult streamResult = new StreamResult(out);


            Transformer transformer = TransformerFactory.newInstance().newTransformer();
            transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
            transformer.setOutputProperty(OutputKeys.INDENT, "yes");
            transformer.setOutputProperty(OutputKeys.METHOD, "html");
            transformer.transform(domSource, streamResult);
            out.close();

            String result = new String(out.toByteArray());
            FileOutputStream fos = new FileOutputStream(outFile);

I tried this approach too. I can save the image from the word document in the pages/imgs directory. What approach should I follow to include it in the converted html which is generated? I'm stuck here. Any help would be greatly appreciated. — Nitin Avula, Dec 14 '18 at 17:00
@NitinAvula I hope yove got this sorted out as I'm just seeing your comments now... It's been a while I looked up Java but, I'll be glad to share my tiny multidoc viewer I built with this as at the time... Or better provide a fiddle or gist, so I can determine how to better help out. — Enrico, May 21 '20 at 16:17
thanks for the reply. It was a problem in my old project. I don't remember if I solved it or found a way around it. Thanks for the help though :) — Nitin Avula, May 21 '20 at 18:02

score 0 · Answer 3 · edited Dec 22 '17 at 02:00

0

Use this should be useful.

public class InlineImageWordToHtmlConverter extends WordToHtmlConverter{
    public InlineImageWordToHtmlConverter(Document document) {
        super(document);
    } 

    @Override
    protected void processImageWithoutPicturesManager(Element currentBlock, boolean inlined, Picture picture) {
        Element img = super.getDocument().createElement("img");
        img.setAttribute("src", "data:image/png;base64,"+Base64.getEncoder().encodeToString(picture.getContent()));
        currentBlock.appendChild(img);
    }
}

edited Dec 22 '17 at 02:00

Stephen Rauch

47,830
31
106
135

answered Dec 22 '17 at 01:42

wenshenjun

1
1

I tried it. I extended the WordToHtmlConverter and overrode the processImageWithoutPicturesManager() method. But the problem I'm facing which I found during debug is that the constructor of InlineImageWordToHtmlConverter() is being called but not the overridden processImageWithoutPicturesManager() method. Hence, the image is not being converted to html from the word document. Please advice how to fix this? – Nitin Avula Dec 14 '18 at 16:59

I need Apache POI Pictures converted from a word document to a html file

3 Answers3

Linked