i'm doing some test with Apache Tika. Goal is to turn complex Word documents (few pages of text, tables, images, bullet list with many level of indentations) into xhtml, preserving as many info/styles as possible.
I found this out of the box example on the offical site. It does its job, but with many limitations:
- Bullets and numbering list are not outputted correctly.
<p class="list_Paragraph">ยท first element of the list</p>
is generated instead of<ul><li>first element of the list</li>....
and indentation levels are lost if there are nested lists. - Text colors, font size, alignment and many other styles are not outputted at all.
- Is it possible to generate a specific output for a specific tag/style? (ex: heading3 to be turned into
<smallHeading>
instead of<h3>
) - Images are not extracted.
Point 4 probably requires an extractor to be implemented (from what i found in other posts), but is it possible to achieve the first 3 points above? Are we talking of a few settings/extending the example parser/handler or everything has to be implemented from scratch? Suggestions?
Thanks a lot.
public String parseToHTML() throws IOException, SAXException, TikaException {
ContentHandler handler = new ToXMLContentHandler();
AutoDetectParser parser = new AutoDetectParser();
Metadata metadata = new Metadata();
try (InputStream stream = ContentHandlerExample.class.getResourceAsStream("test.doc")) {
parser.parse(stream, handler, metadata);
return handler.toString();
}
}
` instead of `
- ` too. The difference is that the resulting HTML markup is less "clean" (empty paragraphs, styles, ...) than the one returned by Tika.
โ lfurini May 16 '18 at 11:33