I'm using an external API which we only allow me regexes, I wanted to parse content from a xml tag like <name>alwin</name>
and i used <.*?>.*?<.*/>
to parse "alwin" .. and it doesn't work, but now it is structured like <name><![CDATA[<table>alwin</table>]]</name>
and I want to be able to parse the CDATA too .. I want to extract [![CDATA[<table>alwin</table]]
as well. And just "alwin" too.
Asked
Active
Viewed 545 times
0

Phoenix
- 8,695
- 16
- 55
- 88
-
3See [this SO answer](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) – Jim Garrison Apr 11 '13 at 06:32
1 Answers
0
Try with this pattern instead:
<([a-zA-Z]+).*?>(.*?)</\1>
The \1
clause targets the first matching group of the pattern, i.e. ([a-zA-Z]+)
. Thus, the matched closing tag will always be the same as the opening one.
The content of the tag will then be available in the second group:
Pattern p = Pattern.compile("<([a-zA-Z]+).*?>(.*?)</\\1>");
Matcher m = p.matcher("<name><![CDATA[<table>alwin</table>]]</name>");
while (m.find()) {
System.out.println(m.group(2));
}
The above snippet prints:
<![CDATA[<table>alwin</table>]]
Reiterate the pattern on the above output to get the alwin
part.

sp00m
- 47,968
- 31
- 142
- 252
-
1This approach will fail for `
<![CDATA[ ` because it will match the outer]]> against the inside the CDATA. It will also fail for ordinarily nested identical tags even ignoring the CDATA issue. – drquicksilver Apr 12 '13 at 14:02