Someone told me that Tika's XWPFWordExtractorDecorator class is used to convert docx into html. But I am not sure how to use this class to get the HTML from docx. Any other library for doing the same job is also appreciated/
Asked
Active
Viewed 3,363 times
1 Answers
4
You shouldn't use it directly
Instead, call Tika in the usual way, and it'll call the appropriate code for you
If you want XHTML from parsing a file, the code looks something like
// Either of these will work, the latter is recommended
//InputStream input = new FileInputStream("test.docx");
InputStream input = TikaInputStream.get(new File("test.docx"));
// AutoDetect is normally best, unless you know the best parser for the type
Parser parser = new AutoDetectParser();
// Handler for indented XHTML
StringWriter sw = new StringWriter();
SAXTransformerFactory factory = (SAXTransformerFactory)
SAXTransformerFactory.newInstance();
TransformerHandler handler = factory.newTransformerHandler();
handler.getTransformer().setOutputProperty(OutputKeys.METHOD, "xml");
handler.getTransformer().setOutputProperty(OutputKeys.INDENT, "yes");
handler.setResult(new StreamResult(sw));
// Call the Tika Parser
try {
Metadata metadata = new Metadata();
parser.parse(input, handler, metadata, new ParseContext());
String xml = sw.toString();
} finally {
input.close();
}

Gagravarr
- 47,320
- 10
- 111
- 156
-
Thanks. What is XMLResult? and how to include the styles in the generated HTML. – Imran Qadir Baksh - Baloch Jan 29 '12 at 14:49
-
If this is not possible to add styles then whether we can convert docx into doc? After converting into doc I can use Apache POI. – Imran Qadir Baksh - Baloch Jan 29 '12 at 16:24
-
XMLResult is part of the test the code is taken from, now edited. The XHTML does include the styles – Gagravarr Jan 29 '12 at 16:50
-
String xml does not include any style. Is there any way to force the parser to include style in the html. – Imran Qadir Baksh - Baloch Jan 29 '12 at 16:54
-
It should contains classes on paragraphs with non-standard styles, set to the style name, and things like bold/italic, that's your style information – Gagravarr Jan 30 '12 at 10:34
-
I don't get what you mean. The return html does not include any styles. it's only have p,b,i,u tags. – Imran Qadir Baksh - Baloch Jan 30 '12 at 12:11
-
If the paragraphs have any non-standard styles applied to them, then that should come through as classes on the paragraphs (or headings etc). You also get bold, italic, underline etc styling information inline, which you say yourself that you're seeing – Gagravarr Jan 30 '12 at 17:03
-
I have not seen any style tag or any inline style in my html. So, all my font size, font width, font family, etc inside docx file are ignored and not present in the generated html. – Imran Qadir Baksh - Baloch Jan 30 '12 at 18:04