regex catching multiline script tag inside html

Question

I need to grab inline script tags inside html pages. The regex will eventually be driven from c#. Now I am using Expresso for test purpose.

The following is the best for now:

.*<script.*\r\n(.*\r\n)*\s*</script>

i.e.

.*<script catch the script tag
.*\r\n catch anything till the end of line
(.*\r\n)* catch other lines of the script
\s*</script> catch the closing script, with any indentation before

It grabs ALL the stuff between the first tag, inculding html and other script tags.

You're having a problem parsing HTML with a regular expression? [Colour me surprised](http://stackoverflow.com/a/1732454/424509)! — CanSpice, Mar 23 '12 at 17:30
If you're going to use this in C# give this a try http://htmlagilitypack.codeplex.com/ — Stephen Gilboy, Mar 23 '12 at 17:40
@CanSpice - I thought the popularity of that post would have put an end to "Can I regex my HTML" questions. Sadly: no. — David, Mar 23 '12 at 18:45
Looking back, I found this [interesting post](http://stackoverflow.com/questions/542194/c-sharp-is-there-a-linq-to-html-or-some-other-good-net-html-manipulation-api). — Alberto De Caro, Jun 29 '12 at 15:51

score 4 · Accepted Answer · edited May 23 '17 at 12:20

Two scripts on the same line will break your regex. Try it on the source of the page with your question.

Parsing HTML with regex is not a very good idea (there is a link in the comment to your question which answers why the <center> cannot hold); use HTML parser instead.

The next code snippet selects the <script> nodes by using HtmlAgilityPack:

var doc = new HtmlDocument();
doc.Load(html);
var scripts = doc.DocumentNode.SelectNodes("//script");

Isn't this is simplier than regex?

Fuhrmanator · Answer 2 · 2012-03-23T18:34:15.147

How about enabling "dot matches all" and using something simple:

<script\b[^>]*>(.*?)</script>

Remember that matching is not the same as capturing. This should capture ($1) what's in between the tags. I did a quick test using http://regexpal.com/

Using bosinski.com/regex in Eclipse (I know it's not C#) here's my test file (followed by results):

<html>
<SCRIPT LANGUAGE="JavaScript"><!--
function demoMatchClick() {
  var re = new RegExp(document.demoMatch.regex.value);
  if (document.demoMatch.subject.value.match(re)) {
    alert("Successful match");
  } else {
    alert("No match");
  }
}
// -->
</SCRIPT>
<script language="fred">
this is the second set of code
</script>
</html>

Results of the regex match:

Found 2 match(es):

start=8, end=275
Group(0) = <SCRIPT LANGUAGE="JavaScript"><!--
function demoMatchClick() {
  var re = new RegExp(document.demoMatch.regex.value);
  if (document.demoMatch.subject.value.match(re)) {
    alert("Successful match");
  } else {
    alert("No match");
  }
}
// -->
</SCRIPT>
Group(1) = <!--
function demoMatchClick() {
  var re = new RegExp(document.demoMatch.regex.value);
  if (document.demoMatch.subject.value.match(re)) {
    alert("Successful match");
  } else {
    alert("No match");
  }
}
// -->

start=277, end=344
Group(0) = <script language="fred">
this is the second set of code
</script>
Group(1) = 
this is the second set of code

score 1 · Answer 3 · answered Mar 23 '12 at 17:56

Depending on who you ask, you have different problems. Either your problem is, you use regex on html, or your quantifiers are too greedy.

I don't know your problem you want to solve, but chances are good, that your solution should be to use a html parser.

If you want to stick to regex, then use the ungreedy version of the quantifier *?. Your regex would then look something like this

.*<script.*\r\n(.*\r\n)*?\s*</script>

that means it would match as less rows as needed till the first closing tag.

score 0 · Answer 4 · answered Dec 08 '14 at 04:55

0

Try this

<(?<tag>script*)[^>]*>(?<content>.*?)<\/\k<tag>>

Replace the word script after <tag> with another element name and you can use it for others too.

answered Dec 08 '14 at 04:55

Jani Hyytiäinen

5,293
36
45

regex catching multiline script tag inside html

4 Answers4

Linked