2

I am having problems in getting all <script> and its respective closing </script> tags from a html text using via regular expressions, in C#.

I created a sample html that looks like:

<html>
<head>
<title>
</title>

<script src="adasdsadsda.js"></script>
</head>

<body>
    <script type='javascript'>
        var a = 1 + 2;

        alert('a');
    </script>
</body>

<script></script>
</html>

The regular expression I am using is:

<script.*>[^>]*<\/script>

I often use regexr to validate/test my regular expressions (highly recommend it!). It shows the regular expression in question captures 3 occurrences (just as I expect).

But C#'s regex.Matches is not capturing 3 instances, instead, a single one with all occurrences in it. Is this the expected behavior for the Matches method ? I have been using it quite a lot and have been getting all occurrences as a separate capture.

Why is this happening in my case ?

P.S: In answering the question, if you want to point out that regex is not suited for parsing HTML, please explain how come regexr and .NET's Regex give different results ? Do they have different regex implementations ?

Veverke
  • 9,208
  • 4
  • 51
  • 95
  • http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Learner Nov 08 '15 at 09:31
  • @SIslam: thanks, but this then means that C#'s Regex implementation is not the same as, say, [regexr](http://www.regexr.com) ? Weird – Veverke Nov 08 '15 at 09:33
  • Ah! I mean do not parse HTML with <> – Learner Nov 08 '15 at 09:36
  • Please see my update in bold ;) – Veverke Nov 08 '15 at 09:36
  • Yes, they're different flavors. RegExr uses your browser's RegExp engine for matching. Use a .net tester instead (http://regexhero.net/tester/ or http://regexstorm.net/tester). However, .net **[also returns the same 3 matches](http://ideone.com/39gZvN)**. That said, if you have a `>` sign in your JavaScript code, it would fail... Don't use regex to parse HTML, [You can use the HTML Agility Pack](http://stackoverflow.com/a/847051/5290909) – Mariano Nov 08 '15 at 09:50
  • @Mariano:thanks. I actually moved from Agility Pack to regex because I had the impression it was not working. Will try it again. Thanks for the other directions, will try them as well. Please re-write you comment as an answer so I can give you some points for helping. – Veverke Nov 08 '15 at 09:52

2 Answers2

1

RegExr uses your browser's RegExp engine for matching. It implements a different regex flavor.

uses a unique regex flavor, so I'd suggest using a online tester instead. For example:

However, the pattern <script.*>[^>]*<\/script> should return the same matched text in almost all flavors.

Code

string pattern = @"<script.*>[^>]*<\/script>";
var re = new Regex( pattern);
var text = @"
        <html>
        <head>
        <title>
        </title>

        <script src=""adasdsadsda.js""></script>
        </head>

        <body>
            <script type='javascript'>
                var a = 1 + 2;

                alert('a');
            </script>
        </body>

        <script></script>
        </html>
    ";


MatchCollection matches = re.Matches(text);
for (int mnum = 0; mnum < matches.Count; mnum++)
{   //loop matches
    Match match = matches[mnum];
    Console.WriteLine("Match #{0} - Value: {1}", mnum + 1, match.Value);
}

Output

Match #1 - Value: <script src="adasdsadsda.js"></script>
Match #2 - Value: <script type='javascript'>
                        var a = 1 + 2;

                        alert('a');
                    </script>
Match #3 - Value: <script></script>

ideone demo


That said, if you have a > sign in your JavaScript code (as part of an IF condition or in a string), it would fail.

There are many reasons not to parse HTML with regex, so please take the following advice: don't use regex. Instead, you can use the HTML Agility Pack(1). edit: Instead, I recommend using a HTML parser.

Community
  • 1
  • 1
Mariano
  • 6,423
  • 4
  • 31
  • 47
  • Yes, I am aware of the "if you have a `>` sign in your JavaScript code (which means one should not rely in such a regex, yes). – Veverke Nov 08 '15 at 10:07
  • 1
    I'm trying your advice and am moved to agiliy pack testing again :-) – Veverke Nov 08 '15 at 10:08
  • As HTML Agility Pack is neither maintained nor HTML5 compatible I would highly suggest using a better solution. There a plenty (for some see the answer of @Veverke). – Florian Rappl Nov 09 '15 at 08:33
1

I am marking Mariano's answer as the solution, but am leaving here the outcome of further research, which is not mentioned in the selected answer:

Seems the most popular options would be, in order of popularity, the following nuget packages:

  • Html Agility Pack
  • CsQuery
  • AngleSharp

I ended up using AngleSharp, which has the advantage over CsQuery of still being maintained/developed.

Veverke
  • 9,208
  • 4
  • 51
  • 95
  • Thanks for contributing with your chosen package. I never used the latter and I'll take a look at it. – Mariano Nov 08 '15 at 12:03