1

i'm doing some test with Apache Tika. Goal is to turn complex Word documents (few pages of text, tables, images, bullet list with many level of indentations) into xhtml, preserving as many info/styles as possible.

I found this out of the box example on the offical site. It does its job, but with many limitations:

  1. Bullets and numbering list are not outputted correctly. <p class="list_Paragraph">ยท first element of the list</p> is generated instead of <ul><li>first element of the list</li>....and indentation levels are lost if there are nested lists.
  2. Text colors, font size, alignment and many other styles are not outputted at all.
  3. Is it possible to generate a specific output for a specific tag/style? (ex: heading3 to be turned into <smallHeading> instead of <h3>)
  4. Images are not extracted.

Point 4 probably requires an extractor to be implemented (from what i found in other posts), but is it possible to achieve the first 3 points above? Are we talking of a few settings/extending the example parser/handler or everything has to be implemented from scratch? Suggestions?

Thanks a lot.

public String parseToHTML() throws IOException, SAXException, TikaException {
    ContentHandler handler = new ToXMLContentHandler();

    AutoDetectParser parser = new AutoDetectParser();
    Metadata metadata = new Metadata();
    try (InputStream stream = ContentHandlerExample.class.getResourceAsStream("test.doc")) {
        parser.parse(stream, handler, metadata);
        return handler.toString();
    }
}
Sgotenks
  • 1,723
  • 4
  • 20
  • 34
  • Apache Tika aims to give clean, semantically meaningful XHTML. If you want all the extra stuff too you'll need to step down into Apache POI โ€“ Gagravarr Mar 07 '18 at 09:08
  • 1
    Well, neither point 1? Turning bullets into ul/li (with nested level eventually) seems a pretty standard requisite. โ€“ Sgotenks Mar 07 '18 at 23:47
  • @Gagravarr a quick test with the code from [this answer](https://stackoverflow.com/a/7901139/4453460), shows that Apache POI creates `

    ` instead of `

    • ` too. The difference is that the resulting HTML markup is less "clean" (empty paragraphs, styles, ...) than the one returned by Tika.
    โ€“ lfurini May 16 '18 at 11:33

0 Answers0