Parsing XML in Groovy should be a piece of cake, but I always run into problems.
I would like to parse a string like this:
<html>
<p>
This is a <span>test</span> with <b>some</b> formattings.<br />
And this has a <ac:special>special</ac:special> formatting.
</p>
</html>
When I do it the standard way new XmlSlurper().parseText(body)
, the parser complains about the  
entity. My secret weapon in cases like this is to use tagsoup:
def parser = new org.ccil.cowan.tagsoup.Parser()
def page = new XmlSlurper(parser).parseText(body)
But now the <ac:sepcial>
tag will be closed immediatly by the parser - the special
text will not be inside this tag in the resulting dom. Even when I disable the namespace-feature:
def parser = new org.ccil.cowan.tagsoup.Parser()
parser.setFeature(parser.namespacesFeature,false)
def page = new XmlSlurper(parser).parseText(body)
Another approach was to use the standard parser and to add a doctype like this one:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
This seems to work for most of my files, but it takes ages for the parser to fetch the dtd and process it.
Any good idea how to solve this?
PS: here is some sample code to play around with:
@Grab(group='org.ccil.cowan.tagsoup', module='tagsoup', version='0.9.7')
def processNode(node) {
def out = new StringBuilder("")
node.children.each {
if (it instanceof String) {
out << it
} else {
out << "<${it.name()}>${processNode(it)}</${it.name()}>"
}
}
return out.toString()
}
def body = """<html>
<p>
This is a <span>test</span> with <b>some</b> formattings.<br />
And this has a <ac:special>special</ac:special> formatting.
</p>
</html>"""
def parser = new org.ccil.cowan.tagsoup.Parser()
parser.setFeature(parser.namespacesFeature,false)
def page = new XmlSlurper(parser).parseText(body)
def out = new StringBuilder("")
page.childNodes().each {
out << processNode(it)
}
println out.toString()
""