Why is C#'s Regex.Matches() returning all matches in a single Match object?

Question

I am having problems in getting all <script> and its respective closing </script> tags from a html text using via regular expressions, in C#.

I created a sample html that looks like:

<html>
<head>
<title>
</title>

<script src="adasdsadsda.js"></script>
</head>

<body>
    <script type='javascript'>
        var a = 1 + 2;

        alert('a');
    </script>
</body>

<script></script>
</html>

The regular expression I am using is:

<script.*>[^>]*<\/script>

I often use regexr to validate/test my regular expressions (highly recommend it!). It shows the regular expression in question captures 3 occurrences (just as I expect).

But C#'s regex.Matches is not capturing 3 instances, instead, a single one with all occurrences in it. Is this the expected behavior for the Matches method ? I have been using it quite a lot and have been getting all occurrences as a separate capture.

Why is this happening in my case ?

P.S: In answering the question, if you want to point out that regex is not suited for parsing HTML, please explain how come regexr and .NET's Regex give different results ? Do they have different regex implementations ?

http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — Learner, Nov 08 '15 at 09:31
@SIslam: thanks, but this then means that C#'s Regex implementation is not the same as, say, [regexr](http://www.regexr.com) ? Weird — Veverke, Nov 08 '15 at 09:33
Yes, they're different flavors. RegExr uses your browser's RegExp engine for matching. Use a .net tester instead (http://regexhero.net/tester/ or http://regexstorm.net/tester). However, .net **[also returns the same 3 matches](http://ideone.com/39gZvN)**. That said, if you have a `>` sign in your JavaScript code, it would fail... Don't use regex to parse HTML, [You can use the HTML Agility Pack](http://stackoverflow.com/a/847051/5290909) — Mariano, Nov 08 '15 at 09:50
@Mariano:thanks. I actually moved from Agility Pack to regex because I had the impression it was not working. Will try it again. Thanks for the other directions, will try them as well. Please re-write you comment as an answer so I can give you some points for helping. — Veverke, Nov 08 '15 at 09:52

score 1 · Accepted Answer · edited May 23 '17 at 11:51

RegExr uses your browser's RegExp engine for matching. It implements a different regex flavor.

.net uses a unique regex flavor, so I'd suggest using a .net online tester instead. For example:

However, the pattern <script.*>[^>]*<\/script> should return the same matched text in almost all flavors.

Code

string pattern = @"<script.*>[^>]*<\/script>";
var re = new Regex( pattern);
var text = @"
        <html>
        <head>
        <title>
        </title>

        <script src=""adasdsadsda.js""></script>
        </head>

        <body>
            <script type='javascript'>
                var a = 1 + 2;

                alert('a');
            </script>
        </body>

        <script></script>
        </html>
    ";


MatchCollection matches = re.Matches(text);
for (int mnum = 0; mnum < matches.Count; mnum++)
{   //loop matches
    Match match = matches[mnum];
    Console.WriteLine("Match #{0} - Value: {1}", mnum + 1, match.Value);
}

Output

Match #1 - Value: <script src="adasdsadsda.js"></script>
Match #2 - Value: <script type='javascript'>
                        var a = 1 + 2;

                        alert('a');
                    </script>
Match #3 - Value: <script></script>

ideone demo

That said, if you have a > sign in your JavaScript code (as part of an IF condition or in a string), it would fail.

There are many reasons not to parse HTML with regex, so please take the following advice: don't use regex. ~~Instead, you can use the HTML Agility Pack⁽¹⁾.~~ edit: Instead, I recommend using a HTML parser.

Yes, I am aware of the "if you have a `>` sign in your JavaScript code (which means one should not rely in such a regex, yes). — Veverke, Nov 08 '15 at 10:07
I'm trying your advice and am moved to agiliy pack testing again :-) — Veverke, Nov 08 '15 at 10:08
As HTML Agility Pack is neither maintained nor HTML5 compatible I would highly suggest using a better solution. There a plenty (for some see the answer of @Veverke). — Florian Rappl, Nov 09 '15 at 08:33

Veverke · Answer 2 · 2015-11-08T10:46:16.507

1

I am marking Mariano's answer as the solution, but am leaving here the outcome of further research, which is not mentioned in the selected answer:

Seems the most popular options would be, in order of popularity, the following nuget packages:

Html Agility Pack
CsQuery
AngleSharp

I ended up using AngleSharp, which has the advantage over CsQuery of still being maintained/developed.

edited Nov 08 '15 at 10:46

answered Nov 08 '15 at 10:40

Veverke

9,208
4
51
95

Thanks for contributing with your chosen package. I never used the latter and I'll take a look at it. – Mariano Nov 08 '15 at 12:03

Why is C#'s Regex.Matches() returning all matches in a single Match object?

2 Answers2