I need to parse various document formats (eg: .docx, .pdf) and convert their content (including) to an .xhtml file. I'm using Apache Tika 1.17 (as maven dependency) in a Java project
I've analyzed several already existing questions about this (one, another), and using a custom EmbeddedDocumentExtractor
, I was able to extract the included .png images alongside the generated .xhtml file.
The problem is that in both cases (.docx and .pdf input files), inside the generated .xhtml file, the images are referred to not simply by their name, instead using this kind of syntax:
<img src="embedded:image5.png" alt="image0.png" />
.
So only the content of the alt element is displayed, not the image itself.
Could I somehow change / configure this ?
Would it be possible to somehow include the images inside the .xhtml file, as binary data ?
Or what other options would I have around this problem ?
Thank you.