I need a regex that matches CDATA elements in html

Question

I'm trying to write a regular expression to match CDATA elements in HTML in a web crawler class in c#.

What I have used in the past is : \<\!\[CDATA\[(?<text>[^\]]*)\]\]\> , but the problem is that this breaks in the presence of array [] elements if there is javascript contained within the CDATA tags. The negation is necessary because if there are multiple I want to match them all.

If I modify the regex to match the end '>' character I have the same problem. Any javascript with a > operator breaks my regex.

So I need to use a negative look-ahead within this regex to ignore ']]>'. How would I write this?

Here's some test data for a quick setup of the problem:

        //Matches any
        string pattern = @"\<\!\[CDATA\[(?<text>[^\]]*)\]\]\>";
        var rx = new Regex(pattern, RegexOptions.Singleline);

        /* Testing...*/

         string eg = @"<![CDATA[TesteyMcTest//]]><![CDATA[TesteyMcTest2//]]><![CDATA[TesteyMcTest//]]><!             [CDATA[TesteyMcTest2//]]>
         <![CDATA[Thisisal3ongarbi4trarys6testwithnumbers//]]><![CDATA             [thisisalo4ngarbitrarytest6withumbers123456//]]><![CDATA[ this.exec = (function(){ var x =              this.GetFakeArray(); var y = x[0]; return y > 3;});//]]> ";

         var mz = rx.Matches(eg);

This example matches every instance of CDATA except for the last one, which contains javascript and ']', '>'

Thanks in advance,

Refer this link : http://stackoverflow.com/questions/4616554/what-is-the-regex-expression-for-cdata — Anurag Jain, Feb 10 '14 at 15:59

score 2 · Accepted Answer · answered Feb 10 '14 at 16:32

The problem is that your <text> subpattern is false! You don't need to avoid ], you need to avoid ] followed by ]>. You can use this subpattern instead:

(?<text>(?>[^]]+|](?!]>))*)

the whole pattern: (note that many characters don't need to be escaped)

@"<!\s*\[CDATA\s*\[(?<text>(?>[^]]+|](?!]>))*)]]>"

I added two \s* to match all your example strings, but if you want to disallow these optional spaces, you can remove the \s*.

score 0 · Answer 2 · edited May 23 '17 at 10:33

0

Does the following work for you: http://regex101.com/r/cT0pT0

\[CDATA\[(.*?)\]\]>

It seems to match what you are asking for... Key here is that the use of .*? (non greedy match) stops on the first occasion that you get ]]>

NOTE - it is usually a REALLY BAD IDEA to use regex for parsing HTML. There are plenty of good libraries available to do the job far more robustly.

See for example What is the best way to parse html in C#?

edited May 23 '17 at 10:33

Community

1
1

answered Feb 10 '14 at 16:36

Floris

45,857
6
70
122

Right. I'm not using purely regex to parse the html document, I'm using html agility pack to do the heavy lifting, which leaves me with some CDATA elements to clean up. – David Dworetzky Feb 10 '14 at 17:34
@DavidDworetzky good to hear it. Does my expression work for you? If it doesn't (it seems to work on the test cases you gave) can you expand your test cases? – Floris Feb 10 '14 at 17:38

I need a regex that matches CDATA elements in html

2 Answers2

Linked