Apache-Tika & Java: parsing documents (.docx, .pdf) INCLUDING images, and use images in generated .xhtml file

Question

I need to parse various document formats (eg: .docx, .pdf) and convert their content (including) to an .xhtml file. I'm using Apache Tika 1.17 (as maven dependency) in a Java project

I've analyzed several already existing questions about this (one, another), and using a custom EmbeddedDocumentExtractor, I was able to extract the included .png images alongside the generated .xhtml file.

The problem is that in both cases (.docx and .pdf input files), inside the generated .xhtml file, the images are referred to not simply by their name, instead using this kind of syntax:

<img src="embedded:image5.png" alt="image0.png" />.

So only the content of the alt element is displayed, not the image itself.

Could I somehow change / configure this ?

Would it be possible to somehow include the images inside the .xhtml file, as binary data ?

Or what other options would I have around this problem ?

Thank you.

Write your own custom Content Handler that re-writes the src to refer to wherever you've chosen to save the embedded images to? — Gagravarr, Jan 24 '18 at 18:19
Thanks, that's the approach that I chose in the end... Would you know how such custom ContentHandler would have to be coded, to somehow "intercept" exactly the _ — Serban, Jan 25 '18 at 14:06
Take a look at `TikaImageRewritingContentHandler` from Alfresco for one example - that re-writes the embedded image links to be absolute for images stored in a certain path, I think that's basically what you need: https://github.com/Alfresco/community-edition-old/blob/master/projects/repository/source/java/org/alfresco/repo/rendition/executer/HTMLRenderingEngine.java — Gagravarr, Jan 25 '18 at 14:15

Apache-Tika & Java: parsing documents (.docx, .pdf) INCLUDING images, and use images in generated .xhtml file

0 Answers0