0

I try to parse the XML output of Stanford NLP in java

DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
InputSource is = new InputSource(new StringReader("<a>"+tagged+"</a>"));
Document doc = builder.parse(is);
doc.getDocumentElement().normalize();
NodeList nl=doc.getElementsByTagName("sentence");

The problem is that the XML output of Stanford NLP contains " like

<word wid="9" pos="``" lemma=""">"</word>

Then, I get the error:

[Fatal Error] :11:34: Element type "word" must be followed by either attribute specifications, ">" or "/>".
Exception in thread "main" org.xml.sax.SAXParseException; lineNumber: 11; columnNumber: 34; Element type "word" must be followed by either attribute specifications, ">" or "/>".
    at java.xml/com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:261)
    at java.xml/com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:339)
    at y.main(y.java:46)

I thought of replacing/escaping """ and >"<, but it is a non-standard approach and may break the entire XML.

Googlebot
  • 15,159
  • 44
  • 133
  • 229
  • That is not valid XML, the double-quote s in the `lemma` attribute need to be escaped correctly. See the duplicate. – Jim Garrison Jun 22 '21 at 17:59
  • @JimGarrison I have no control over the `stanford NLP` output. How can I escape/replace `"`? This is indeed what I clarified in the question. – Googlebot Jun 22 '21 at 18:04
  • This may be a bug in the Stanford NLP XML output formatter. It appears to be using CSV-style doubling for "escaping" quotes inside quotes, which is not valid in XML. I suggest you ask on one of the [Stanford mailing lists](https://nlp.stanford.edu/software/email.html), most likely `java-nlp-user@lists.stanford.edu`. You have to join before you can post, the link to join is on the page I linked earlier. – Jim Garrison Jun 22 '21 at 18:11
  • @JimGarrison I don't think it's a bug, as it is the output format for years. The aim is not to have a standard XML format. I had to wrap the children in `` to be able to parse it. I have to find a way to modify the current format. They do not alter the words. – Googlebot Jun 22 '21 at 18:22
  • 1
    This is indeed a bug in the XML output of the standalone POS Tagger – as far as I can see, neither the Stanford Parser or Stanford CoreNLP have this problem. I have just fixed it in commit 3a9e7df56, so the next release should produce well-formed XML. – Christopher Manning Jul 05 '21 at 22:11

0 Answers0