If the needed format is HTML
, then Word
text content together with Office MathML equations can be read the following way.
In Reading equations & formula from Word (Docx) to html and save database using java I have provided an example which gets all Office MathML
equations out of an Word
document into HTML
. It uses paragraph.getCTP().getOMathList()
and paragraph.getCTP().getOMathParaList()
to get the OMath
elements from the paragraph. This takes the OMath
elements out of the text context.
If one wants get those OMath
elements in context with the other elements in the paragraphs, then using a org.apache.xmlbeans.XmlCursor
is needed to loop over all different XML
elements in the paragraph. The following example uses the XmlCursor
to get text runs together with OMath
elements from the paragraph.
The transformation from Office MathML
into MathML is taken using the same XSLT
approach as in Reading equations & formula from Word (Docx) to html and save database using java. There also is described where the OMML2MML.XSL
comes from.
The file Formula.docx
looks like:

Code:
import java.io.*;
import org.apache.poi.xwpf.usermodel.*;
import org.openxmlformats.schemas.wordprocessingml.x2006.main.CTP;
import org.openxmlformats.schemas.officeDocument.x2006.math.CTOMath;
import org.openxmlformats.schemas.officeDocument.x2006.math.CTOMathPara;
import org.apache.xmlbeans.XmlCursor;
import org.w3c.dom.Node;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamSource;
import javax.xml.transform.stream.StreamResult;
import java.awt.Desktop;
import java.util.List;
import java.util.ArrayList;
/*
needs the full ooxml-schemas-1.4.jar as mentioned in https://poi.apache.org/faq.html#faq-N10025
*/
public class WordReadTextWithFormulasAsHTML {
static File stylesheet = new File("OMML2MML.XSL");
static TransformerFactory tFactory = TransformerFactory.newInstance();
static StreamSource stylesource = new StreamSource(stylesheet);
//method for getting MathML from oMath
static String getMathML(CTOMath ctomath) throws Exception {
Transformer transformer = tFactory.newTransformer(stylesource);
Node node = ctomath.getDomNode();
DOMSource source = new DOMSource(node);
StringWriter stringwriter = new StringWriter();
StreamResult result = new StreamResult(stringwriter);
transformer.setOutputProperty("omit-xml-declaration", "yes");
transformer.transform(source, result);
String mathML = stringwriter.toString();
stringwriter.close();
//The native OMML2MML.XSL transforms OMML into MathML as XML having special name spaces.
//We don't need this since we want using the MathML in HTML, not in XML.
//So ideally we should changing the OMML2MML.XSL to not do so.
//But to take this example as simple as possible, we are using replace to get rid of the XML specialities.
mathML = mathML.replaceAll("xmlns:m=\"http://schemas.openxmlformats.org/officeDocument/2006/math\"", "");
mathML = mathML.replaceAll("xmlns:mml", "xmlns");
mathML = mathML.replaceAll("mml:", "");
return mathML;
}
//method for getting HTML including MathML from XWPFParagraph
static String getTextAndFormulas(XWPFParagraph paragraph) throws Exception {
StringBuffer textWithFormulas = new StringBuffer();
//using a cursor to go through the paragraph from top to down
XmlCursor xmlcursor = paragraph.getCTP().newCursor();
while (xmlcursor.hasNextToken()) {
XmlCursor.TokenType tokentype = xmlcursor.toNextToken();
if (tokentype.isStart()) {
if (xmlcursor.getName().getPrefix().equalsIgnoreCase("w") && xmlcursor.getName().getLocalPart().equalsIgnoreCase("r")) {
//elements w:r are text runs within the paragraph
//simply append the text data
textWithFormulas.append(xmlcursor.getTextValue());
} else if (xmlcursor.getName().getLocalPart().equalsIgnoreCase("oMath")) {
//we have oMath
//append the oMath as MathML
textWithFormulas.append(getMathML((CTOMath)xmlcursor.getObject()));
}
} else if (tokentype.isEnd()) {
//we have to check whether we are at the end of the paragraph
xmlcursor.push();
xmlcursor.toParent();
if (xmlcursor.getName().getLocalPart().equalsIgnoreCase("p")) {
break;
}
xmlcursor.pop();
}
}
return textWithFormulas.toString();
}
public static void main(String[] args) throws Exception {
XWPFDocument document = new XWPFDocument(new FileInputStream("Formula.docx"));
//using a StringBuffer for appending all the content as HTML
StringBuffer allHTML = new StringBuffer();
//loop over all IBodyElements - should be self explained
for (IBodyElement ibodyelement : document.getBodyElements()) {
if (ibodyelement.getElementType().equals(BodyElementType.PARAGRAPH)) {
XWPFParagraph paragraph = (XWPFParagraph)ibodyelement;
allHTML.append("<p>");
allHTML.append(getTextAndFormulas(paragraph));
allHTML.append("</p>");
} else if (ibodyelement.getElementType().equals(BodyElementType.TABLE)) {
XWPFTable table = (XWPFTable)ibodyelement;
allHTML.append("<table border=1>");
for (XWPFTableRow row : table.getRows()) {
allHTML.append("<tr>");
for (XWPFTableCell cell : row.getTableCells()) {
allHTML.append("<td>");
for (XWPFParagraph paragraph : cell.getParagraphs()) {
allHTML.append("<p>");
allHTML.append(getTextAndFormulas(paragraph));
allHTML.append("</p>");
}
allHTML.append("</td>");
}
allHTML.append("</tr>");
}
allHTML.append("</table>");
}
}
document.close();
//creating a sample HTML file
String encoding = "UTF-8";
FileOutputStream fos = new FileOutputStream("result.html");
OutputStreamWriter writer = new OutputStreamWriter(fos, encoding);
writer.write("<!DOCTYPE html>\n");
writer.write("<html lang=\"en\">");
writer.write("<head>");
writer.write("<meta charset=\"utf-8\"/>");
//using MathJax for helping all browsers to interpret MathML
writer.write("<script type=\"text/javascript\"");
writer.write(" async src=\"https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=MML_CHTML\"");
writer.write(">");
writer.write("</script>");
writer.write("</head>");
writer.write("<body>");
writer.write(allHTML.toString());
writer.write("</body>");
writer.write("</html>");
writer.close();
Desktop.getDesktop().browse(new File("result.html").toURI());
}
}
Result:

Just tested this code using apache poi 5.0.0
and it works. You need poi-ooxml-full-5.0.0.jar
for apache poi 5.0.0
. Please read https://poi.apache.org/help/faq.html#faq-N10025 for what ooxml
libraries are needed for what apache poi
version.