This regex doesn't work for CDATA in xml. How do I fix this?

Question

I'm using an external API which we only allow me regexes, I wanted to parse content from a xml tag like <name>alwin</name> and i used <.*?>.*?<.*/> to parse "alwin" .. and it doesn't work, but now it is structured like <name><![CDATA[<table>alwin</table>]]</name> and I want to be able to parse the CDATA too .. I want to extract [![CDATA[<table>alwin</table]] as well. And just "alwin" too.

See [this SO answer](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) — Jim Garrison, Apr 11 '13 at 06:32

score 0 · Answer 1 · answered Apr 11 '13 at 07:42

Try with this pattern instead:

<([a-zA-Z]+).*?>(.*?)</\1>

The \1 clause targets the first matching group of the pattern, i.e. ([a-zA-Z]+). Thus, the matched closing tag will always be the same as the opening one.

The content of the tag will then be available in the second group:

Pattern p = Pattern.compile("<([a-zA-Z]+).*?>(.*?)</\\1>");
Matcher m = p.matcher("<name><![CDATA[<table>alwin</table>]]</name>");
while (m.find()) {
    System.out.println(m.group(2));
}

The above snippet prints:

<![CDATA[<table>alwin</table>]]

Reiterate the pattern on the above output to get the alwin part.

This approach will fail for `<![CDATA[]]>` because it will match the outer against the inside the CDATA. It will also fail for ordinarily nested identical tags even ignoring the CDATA issue. — drquicksilver, Apr 12 '13 at 14:02

This regex doesn't work for CDATA in xml. How do I fix this?

1 Answers1

Linked