I'm trying to write a regular expression to match CDATA elements in HTML in a web crawler class in c#.
What I have used in the past is : \<\!\[CDATA\[(?<text>[^\]]*)\]\]\>
, but the problem is that this breaks in the presence of array [] elements if there is javascript contained within the CDATA tags. The negation is necessary because if there are multiple I want to match them all.
If I modify the regex to match the end '>' character I have the same problem. Any javascript with a > operator breaks my regex.
So I need to use a negative look-ahead within this regex to ignore ']]>'. How would I write this?
Here's some test data for a quick setup of the problem:
//Matches any
string pattern = @"\<\!\[CDATA\[(?<text>[^\]]*)\]\]\>";
var rx = new Regex(pattern, RegexOptions.Singleline);
/* Testing...*/
string eg = @"<![CDATA[TesteyMcTest//]]><![CDATA[TesteyMcTest2//]]><![CDATA[TesteyMcTest//]]><! [CDATA[TesteyMcTest2//]]>
<![CDATA[Thisisal3ongarbi4trarys6testwithnumbers//]]><![CDATA [thisisalo4ngarbitrarytest6withumbers123456//]]><![CDATA[ this.exec = (function(){ var x = this.GetFakeArray(); var y = x[0]; return y > 3;});//]]> ";
var mz = rx.Matches(eg);
This example matches every instance of CDATA except for the last one, which contains javascript and ']', '>'
Thanks in advance,