How to read a Word file and get both formatting and math expression information in java?

Question

I am trying to read a word file which looks like this: Docx file screenshot

First I have tried using XWPFRun:

public String[] readStringFromFile(String absolutePath) throws Exception {
    FileInputStream fileInputStream = new FileInputStream(absolutePath);
    XWPFDocument document = new XWPFDocument(fileInputStream);
    List<XWPFParagraph> paragraphs = document.getParagraphs();
    List<String> strings = new ArrayList<>();
    
    for (XWPFParagraph paragraph : paragraphs) {
        strings.add(paragraph.getText());
        for (XWPFRun run :
                paragraph.getRuns()) {
            System.out.println("Run: " + run.text());
            System.out.println("Run infos:");
            System.out.println("Bold: " + run.isBold() + " Italic: " + run.isItalic() + " Underlined: " + run.getUnderline());
            System.out.println("Superscript/Subscript: " + run.getVerticalAlignment() + "\n");
        }
    }
    document.close();
    return strings.toArray(new String[0]);
}

Output:

Run: This one is a test Docx file.
Run infos:
Bold: false Italic: false Underlined: NONE
Superscript/Subscript: baseline

Run: Math: 
Run infos:
Bold: false Italic: false Underlined: NONE
Superscript/Subscript: baseline

Run: This is bold
Run infos:
Bold: true Italic: false Underlined: NONE
Superscript/Subscript: baseline

Run: This one’s italic
Run infos:
Bold: false Italic: true Underlined: NONE
Superscript/Subscript: baseline

Run: Underlined
Run infos:
Bold: false Italic: false Underlined: SINGLE
Superscript/Subscript: baseline

Run: Bold Italic and underlined
Run infos:
Bold: true Italic: true Underlined: SINGLE
Superscript/Subscript: baseline

Run: Bold and italic
Run infos:
Bold: true Italic: true Underlined: NONE
Superscript/Subscript: baseline

Run: In Same Line: 
Run infos:
Bold: false Italic: false Underlined: NONE
Superscript/Subscript: baseline

Run: Bold
Run infos:
Bold: true Italic: false Underlined: NONE
Superscript/Subscript: baseline

Run:  
Run infos:
Bold: false Italic: false Underlined: NONE
Superscript/Subscript: baseline

Run: Italic
Run infos:
Bold: false Italic: true Underlined: NONE
Superscript/Subscript: baseline

Run:  
Run infos:
Bold: false Italic: false Underlined: NONE
Superscript/Subscript: baseline

Run: Underlined
Run infos:
Bold: false Italic: false Underlined: SINGLE
Superscript/Subscript: baseline

Run:  
Run infos:
Bold: false Italic: false Underlined: NONE
Superscript/Subscript: baseline

Run: Bold and Italic
Run infos:
Bold: true Italic: true Underlined: NONE
Superscript/Subscript: baseline

Run: W
Run infos:
Bold: false Italic: false Underlined: NONE
Superscript/Subscript: baseline

Run: e have some
Run infos:
Bold: false Italic: false Underlined: NONE
Superscript/Subscript: baseline

Run: superscript
Run infos:
Bold: false Italic: false Underlined: NONE
Superscript/Subscript: superscript

Run:  and
Run infos:
Bold: false Italic: false Underlined: NONE
Superscript/Subscript: baseline

Run: subscript
Run infos:
Bold: false Italic: false Underlined: NONE
Superscript/Subscript: subscript

This is not giving me any math information.

Then I tried using Apache Tika and getting the information from the returned HTML:

private void getHtmlUsingTika(String absolutePath) throws IOException, TikaException, SAXException {
    ContentHandler handler = new ToXMLContentHandler();
    AutoDetectParser parser = new AutoDetectParser();
    Metadata metadata = new Metadata();

    InputStream stream = new FileInputStream(new File(absolutePath));

    parser.parse(stream,handler,metadata);
    System.out.println(handler.toString());
}

Output:

<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="cp:revision" content="25" />
<meta name="extended-properties:AppVersion" content="15.0000" />
<meta name="meta:paragraph-count" content="1" />
<meta name="meta:word-count" content="33" />
<meta name="extended-properties:Application" content="Microsoft Office Word" />
<meta name="meta:last-author" content="Microsoft account" />
<meta name="extended-properties:Company" content="" />
<meta name="xmpTPg:NPages" content="1" />
<meta name="dcterms:created" content="2022-04-25T09:09:00Z" />
<meta name="meta:line-count" content="1" />
<meta name="dcterms:modified" content="2022-08-28T07:08:00Z" />
<meta name="meta:character-count" content="189" />
<meta name="extended-properties:Template" content="Normal.dotm" />
<meta name="meta:character-count-with-spaces" content="221" />
<meta name="X-TIKA:Parsed-By" content="org.apache.tika.parser.DefaultParser" />
<meta name="X-TIKA:Parsed-By" content="org.apache.tika.parser.microsoft.ooxml.OOXMLParser" />
<meta name="extended-properties:DocSecurityString" content="None" />
<meta name="extended-properties:TotalTime" content="35" />
<meta name="meta:page-count" content="1" />
<meta name="Content-Type" content="application/vnd.openxmlformats-officedocument.wordprocessingml.document" />
<meta name="dc:publisher" content="" />
<title></title>
</head>
<body><p>This one is a test Docx file.</p>
<p>Math: </p>
<p><b>This is bold</b></p>
<p><i>This one’s italic</i></p>
<p><u>Underlined</u></p>
<p><b><i><u>Bold Italic and underlined</u></i></b></p>
<p><b><i>Bold and italic</i></b></p>
<p>In Same Line: <b>Bold</b> <i>Italic</i> <u>Underlined</u> <b><i>Bold and Italic</i></b></p>
<p>We have somesuperscript andsubscript</p>
<p><a name="_GoBack" /></p>
</body></html>

Again I am not getting any math information furthermore I'm not getting any superscript and subscript information using this approach.

I am a beginner when it comes to handling word documents and stuck with this issue for a while now. Is there a way to get both math and other style information at the same time in java? Also is it possible using javascript?

For how to read equations to HTML see https://stackoverflow.com/questions/59414033/reading-equations-from-word-docx-to-html-together-with-their-text-context-us/59502090#59502090. That does not answer the question about the formatting. But the task to fully convert a Word document to HTML will lead to program a whole library. That is much too broad for an answer here. — Axel Richter, Aug 28 '22 at 08:20
*"This is not giving me any math information"* What did you expect exactly? How could you convert an arbitrary math expression to a simple text string? — Olivier, Aug 28 '22 at 08:47
@Olivier I am expecting something like a string/boolean by which I can determine if there is any math present or not. For example in my case: $1Σ10 x$ — Adnan Bin Zahir, Aug 28 '22 at 09:01
It looks as if the libraries you are using discard equation XML altogether. Not a familiar area but you might have to use something like POI-OpenXML4J to get the Math OOXML and interpret any formatting elements yourself. (Don't know whether many people actually do much formatting *inside* an equation as they may rely on the equation's built-in features for most formatting. Either you can rely on that or yoou can't!). — jonsson, Aug 29 '22 at 14:21

How to read a Word file and get both formatting and math expression information in java?

0 Answers0