Parsing XML in Groovy with namespace and entities

Question

Parsing XML in Groovy should be a piece of cake, but I always run into problems.

I would like to parse a string like this:

<html>
<p>
This&nbsp;is a <span>test</span> with <b>some</b> formattings.<br />
And this has a <ac:special>special</ac:special> formatting.
</p>
</html>

When I do it the standard way new XmlSlurper().parseText(body), the parser complains about the &nbsp entity. My secret weapon in cases like this is to use tagsoup:

def parser = new org.ccil.cowan.tagsoup.Parser()
def page = new XmlSlurper(parser).parseText(body)

But now the <ac:sepcial> tag will be closed immediatly by the parser - the special text will not be inside this tag in the resulting dom. Even when I disable the namespace-feature:

def parser = new org.ccil.cowan.tagsoup.Parser()
parser.setFeature(parser.namespacesFeature,false)
def page = new XmlSlurper(parser).parseText(body)

Another approach was to use the standard parser and to add a doctype like this one:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

This seems to work for most of my files, but it takes ages for the parser to fetch the dtd and process it.

Any good idea how to solve this?

PS: here is some sample code to play around with:

@Grab(group='org.ccil.cowan.tagsoup', module='tagsoup', version='0.9.7')
def processNode(node) {
    def out = new StringBuilder("")
    node.children.each {
        if (it instanceof String) {
            out << it
        } else {
            out << "<${it.name()}>${processNode(it)}</${it.name()}>"
        }
    }
    return out.toString()
}

def body = """<html>
<p>
This&nbsp;is a <span>test</span> with <b>some</b> formattings.<br />
And this has a <ac:special>special</ac:special> formatting.
</p>
</html>"""

def parser = new org.ccil.cowan.tagsoup.Parser()
parser.setFeature(parser.namespacesFeature,false)
def page = new XmlSlurper(parser).parseText(body)
def out = new StringBuilder("")
page.childNodes().each {
    out << processNode(it)
}
println out.toString()
""

score 2 · Accepted Answer · answered Aug 18 '13 at 10:09

You will have to decide whether you want parsing to conform to standards, going the DTD path, or accept just anything with a permissive parser.

Tagsoup in my experience is fine for the latter and it rarely creates any problems, so I was surprised to read your remark about its handling of "special". A quick test also showed that I could not reproduce it: when running this command

  java net.sf.saxon.Query -x:org.ccil.cowan.tagsoup.Parser -s:- -qs:. !encoding=ASCII !indent=yes

on your sample, I received this result

<?xml version="1.0" encoding="ASCII"?>
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:html="http://www.w3.org/1999/xhtml">
   <body>
      <p>
    This&#xa0;is a <span>test</span> with <b>some</b> formattings.<br clear="none"/>
    And this has a <ac:special xmlns:ac="urn:x-prefix:ac">special</ac:special> formatting.
  </p>

   </body>
</html>

from both TagSoup 1.2 and 1.2.1. So for me that behaved as expected, the text "special" appearing inside of the "ac:special" tag.

As for the DTD variant, you could look after going through a caching proxy for resolving the DTD, refer to a local copy, or even reduce the DTD to the bare minimum that you need. The following should be sufficient to get you across the   entity:

<!DOCTYPE DOC[<!ENTITY nbsp "&#160;">]>

Great! It was the version of the tagsoup parser I used (0.9.x)... 1.2.1 works fine for me. Thanx! — rdmueller, Aug 18 '13 at 11:18

Parsing XML in Groovy with namespace and entities

1 Answers1