unpack DOCX with java and parse XML with XPath

Asked Nov 25 '17 at 18:45

Active Nov 25 '17 at 20:01

Viewed 439 times

I'm trying to parse document.xml inside DOCX archive, but stuck because can't retrieve NodeList with XPath.

File docxFile = new File ("input.docx");
URI docxUri = URI.create("jar:" + docxFile.toURI());
Map<String, String> zipProperties = new HashMap<>();
zipProperties.put("encoding", "UTF-8");
FileSystem zipFS = FileSystems.newFileSystem(docxUri, zipProperties);
Path documentXmlPath = zipFS.getPath("/word/document.xml");

DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setNamespaceAware(true);
DocumentBuilder builder = factory.newDocumentBuilder();
Document doc = builder.parse(Files.newInputStream(documentXmlPath));
NodeList paragraphs = doc.getElementsByTagName("w:p");
System.out.println(paragraphs.getLength()); // gives real number of nodes

XPath xpath = XPathFactory.newInstance().newXPath();
String expression = "//w:p";
NodeList nodes = (NodeList) xpath.compile(expression).evaluate(doc, XPathConstants.NODESET);
System.out.println(nodes.getLength()); // gives 0;

DOM getElementsByTagName() method work fine. But not XPath. What I'm doing wrong?

edited Nov 25 '17 at 20:01

asked Nov 25 '17 at 18:45

Vitaliy

1

Most likely it requires a namespace definition as is usual – Sami Kuhmonen Nov 25 '17 at 18:46
Many thanks. It was really associated with namespaces. Have found the solution here: https://stackoverflow.com/a/6392700/6711224 – Vitaliy Nov 25 '17 at 20:00

unpack DOCX with java and parse XML with XPath

0 Answers0