0

I need to grab inline script tags inside html pages. The regex will eventually be driven from c#. Now I am using Expresso for test purpose.

The following is the best for now:

.*<script.*\r\n(.*\r\n)*\s*</script>

i.e.

  • .*<script catch the script tag
  • .*\r\n catch anything till the end of line
  • (.*\r\n)* catch other lines of the script
  • \s*</script> catch the closing script, with any indentation before

It grabs ALL the stuff between the first tag, inculding html and other script tags.

Alberto De Caro
  • 5,147
  • 9
  • 47
  • 73
  • 6
    You're having a problem parsing HTML with a regular expression? [Colour me surprised](http://stackoverflow.com/a/1732454/424509)! – CanSpice Mar 23 '12 at 17:30
  • 1
    If you're going to use this in C# give this a try http://htmlagilitypack.codeplex.com/ – Stephen Gilboy Mar 23 '12 at 17:40
  • 1
    @CanSpice - I thought the popularity of that post would have put an end to "Can I regex my HTML" questions. Sadly: no. – David Mar 23 '12 at 18:45
  • Looking back, I found this [interesting post](http://stackoverflow.com/questions/542194/c-sharp-is-there-a-linq-to-html-or-some-other-good-net-html-manipulation-api). – Alberto De Caro Jun 29 '12 at 15:51

4 Answers4

4

Two scripts on the same line will break your regex. Try it on the source of the page with your question.

Parsing HTML with regex is not a very good idea (there is a link in the comment to your question which answers why the <center> cannot hold); use HTML parser instead.

The next code snippet selects the <script> nodes by using HtmlAgilityPack:

var doc = new HtmlDocument();
doc.Load(html);
var scripts = doc.DocumentNode.SelectNodes("//script");

Isn't this is simplier than regex?

Community
  • 1
  • 1
Oleks
  • 31,955
  • 11
  • 77
  • 132
1

How about enabling "dot matches all" and using something simple:

<script\b[^>]*>(.*?)</script>

Remember that matching is not the same as capturing. This should capture ($1) what's in between the tags. I did a quick test using http://regexpal.com/

Using bosinski.com/regex in Eclipse (I know it's not C#) here's my test file (followed by results):

<html>
<SCRIPT LANGUAGE="JavaScript"><!--
function demoMatchClick() {
  var re = new RegExp(document.demoMatch.regex.value);
  if (document.demoMatch.subject.value.match(re)) {
    alert("Successful match");
  } else {
    alert("No match");
  }
}
// -->
</SCRIPT>
<script language="fred">
this is the second set of code
</script>
</html>

Results of the regex match:

Found 2 match(es):

start=8, end=275
Group(0) = <SCRIPT LANGUAGE="JavaScript"><!--
function demoMatchClick() {
  var re = new RegExp(document.demoMatch.regex.value);
  if (document.demoMatch.subject.value.match(re)) {
    alert("Successful match");
  } else {
    alert("No match");
  }
}
// -->
</SCRIPT>
Group(1) = <!--
function demoMatchClick() {
  var re = new RegExp(document.demoMatch.regex.value);
  if (document.demoMatch.subject.value.match(re)) {
    alert("Successful match");
  } else {
    alert("No match");
  }
}
// -->

start=277, end=344
Group(0) = <script language="fred">
this is the second set of code
</script>
Group(1) = 
this is the second set of code
Fuhrmanator
  • 11,459
  • 6
  • 62
  • 111
1

Depending on who you ask, you have different problems. Either your problem is, you use regex on html, or your quantifiers are too greedy.

I don't know your problem you want to solve, but chances are good, that your solution should be to use a html parser.

If you want to stick to regex, then use the ungreedy version of the quantifier *?. Your regex would then look something like this

.*<script.*\r\n(.*\r\n)*?\s*</script>

that means it would match as less rows as needed till the first closing tag.

stema
  • 90,351
  • 20
  • 107
  • 135
0

Try this

<(?<tag>script*)[^>]*>(?<content>.*?)<\/\k<tag>>

Replace the word script after <tag> with another element name and you can use it for others too.

Jani Hyytiäinen
  • 5,293
  • 36
  • 45