I am trying to read a word file which looks like this: Docx file screenshot
First I have tried using XWPFRun:
public String[] readStringFromFile(String absolutePath) throws Exception {
FileInputStream fileInputStream = new FileInputStream(absolutePath);
XWPFDocument document = new XWPFDocument(fileInputStream);
List<XWPFParagraph> paragraphs = document.getParagraphs();
List<String> strings = new ArrayList<>();
for (XWPFParagraph paragraph : paragraphs) {
strings.add(paragraph.getText());
for (XWPFRun run :
paragraph.getRuns()) {
System.out.println("Run: " + run.text());
System.out.println("Run infos:");
System.out.println("Bold: " + run.isBold() + " Italic: " + run.isItalic() + " Underlined: " + run.getUnderline());
System.out.println("Superscript/Subscript: " + run.getVerticalAlignment() + "\n");
}
}
document.close();
return strings.toArray(new String[0]);
}
Output:
Run: This one is a test Docx file.
Run infos:
Bold: false Italic: false Underlined: NONE
Superscript/Subscript: baseline
Run: Math:
Run infos:
Bold: false Italic: false Underlined: NONE
Superscript/Subscript: baseline
Run: This is bold
Run infos:
Bold: true Italic: false Underlined: NONE
Superscript/Subscript: baseline
Run: This one’s italic
Run infos:
Bold: false Italic: true Underlined: NONE
Superscript/Subscript: baseline
Run: Underlined
Run infos:
Bold: false Italic: false Underlined: SINGLE
Superscript/Subscript: baseline
Run: Bold Italic and underlined
Run infos:
Bold: true Italic: true Underlined: SINGLE
Superscript/Subscript: baseline
Run: Bold and italic
Run infos:
Bold: true Italic: true Underlined: NONE
Superscript/Subscript: baseline
Run: In Same Line:
Run infos:
Bold: false Italic: false Underlined: NONE
Superscript/Subscript: baseline
Run: Bold
Run infos:
Bold: true Italic: false Underlined: NONE
Superscript/Subscript: baseline
Run:
Run infos:
Bold: false Italic: false Underlined: NONE
Superscript/Subscript: baseline
Run: Italic
Run infos:
Bold: false Italic: true Underlined: NONE
Superscript/Subscript: baseline
Run:
Run infos:
Bold: false Italic: false Underlined: NONE
Superscript/Subscript: baseline
Run: Underlined
Run infos:
Bold: false Italic: false Underlined: SINGLE
Superscript/Subscript: baseline
Run:
Run infos:
Bold: false Italic: false Underlined: NONE
Superscript/Subscript: baseline
Run: Bold and Italic
Run infos:
Bold: true Italic: true Underlined: NONE
Superscript/Subscript: baseline
Run: W
Run infos:
Bold: false Italic: false Underlined: NONE
Superscript/Subscript: baseline
Run: e have some
Run infos:
Bold: false Italic: false Underlined: NONE
Superscript/Subscript: baseline
Run: superscript
Run infos:
Bold: false Italic: false Underlined: NONE
Superscript/Subscript: superscript
Run: and
Run infos:
Bold: false Italic: false Underlined: NONE
Superscript/Subscript: baseline
Run: subscript
Run infos:
Bold: false Italic: false Underlined: NONE
Superscript/Subscript: subscript
This is not giving me any math information.
Then I tried using Apache Tika and getting the information from the returned HTML:
private void getHtmlUsingTika(String absolutePath) throws IOException, TikaException, SAXException {
ContentHandler handler = new ToXMLContentHandler();
AutoDetectParser parser = new AutoDetectParser();
Metadata metadata = new Metadata();
InputStream stream = new FileInputStream(new File(absolutePath));
parser.parse(stream,handler,metadata);
System.out.println(handler.toString());
}
Output:
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="cp:revision" content="25" />
<meta name="extended-properties:AppVersion" content="15.0000" />
<meta name="meta:paragraph-count" content="1" />
<meta name="meta:word-count" content="33" />
<meta name="extended-properties:Application" content="Microsoft Office Word" />
<meta name="meta:last-author" content="Microsoft account" />
<meta name="extended-properties:Company" content="" />
<meta name="xmpTPg:NPages" content="1" />
<meta name="dcterms:created" content="2022-04-25T09:09:00Z" />
<meta name="meta:line-count" content="1" />
<meta name="dcterms:modified" content="2022-08-28T07:08:00Z" />
<meta name="meta:character-count" content="189" />
<meta name="extended-properties:Template" content="Normal.dotm" />
<meta name="meta:character-count-with-spaces" content="221" />
<meta name="X-TIKA:Parsed-By" content="org.apache.tika.parser.DefaultParser" />
<meta name="X-TIKA:Parsed-By" content="org.apache.tika.parser.microsoft.ooxml.OOXMLParser" />
<meta name="extended-properties:DocSecurityString" content="None" />
<meta name="extended-properties:TotalTime" content="35" />
<meta name="meta:page-count" content="1" />
<meta name="Content-Type" content="application/vnd.openxmlformats-officedocument.wordprocessingml.document" />
<meta name="dc:publisher" content="" />
<title></title>
</head>
<body><p>This one is a test Docx file.</p>
<p>Math: </p>
<p><b>This is bold</b></p>
<p><i>This one’s italic</i></p>
<p><u>Underlined</u></p>
<p><b><i><u>Bold Italic and underlined</u></i></b></p>
<p><b><i>Bold and italic</i></b></p>
<p>In Same Line: <b>Bold</b> <i>Italic</i> <u>Underlined</u> <b><i>Bold and Italic</i></b></p>
<p>We have somesuperscript andsubscript</p>
<p><a name="_GoBack" /></p>
</body></html>
Again I am not getting any math information furthermore I'm not getting any superscript and subscript information using this approach.
I am a beginner when it comes to handling word documents and stuck with this issue for a while now. Is there a way to get both math and other style information at the same time in java? Also is it possible using javascript?