0

I'm using an external API which we only allow me regexes, I wanted to parse content from a xml tag like <name>alwin</name> and i used <.*?>.*?<.*/> to parse "alwin" .. and it doesn't work, but now it is structured like <name><![CDATA[<table>alwin</table>]]</name> and I want to be able to parse the CDATA too .. I want to extract [![CDATA[<table>alwin</table]] as well. And just "alwin" too.

Phoenix
  • 8,695
  • 16
  • 55
  • 88
  • 3
    See [this SO answer](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) – Jim Garrison Apr 11 '13 at 06:32

1 Answers1

0

Try with this pattern instead:

<([a-zA-Z]+).*?>(.*?)</\1>

The \1 clause targets the first matching group of the pattern, i.e. ([a-zA-Z]+). Thus, the matched closing tag will always be the same as the opening one.

The content of the tag will then be available in the second group:

Pattern p = Pattern.compile("<([a-zA-Z]+).*?>(.*?)</\\1>");
Matcher m = p.matcher("<name><![CDATA[<table>alwin</table>]]</name>");
while (m.find()) {
    System.out.println(m.group(2));
}

The above snippet prints:

<![CDATA[<table>alwin</table>]]

Reiterate the pattern on the above output to get the alwin part.

sp00m
  • 47,968
  • 31
  • 142
  • 252
  • 1
    This approach will fail for `<![CDATA[]]>` because it will match the outer against the inside the CDATA. It will also fail for ordinarily nested identical tags even ignoring the CDATA issue. – drquicksilver Apr 12 '13 at 14:02