2

I don't know Regex very well, and I'm trying to get all of the script tags from some extracted page text. I've tried the following pattern:

<script.*?>.*?</script>

But this doesn't seem to return any script tag that has any code within it. I.e. it from the following:

<script type="text/javascript" src="Scripts/Scipt1.js"></script>
<script type="text/javascript" src="Scripts/Scipt2.js"></script>

<script type="text/javascript">
   function SomeMethod()
   {

   }
</script>

I'll only get the following results:

<script type="text/javascript" src="Scripts/Scipt1.js"></script>
<script type="text/javascript" src="Scripts/Scipt2.js"></script>

How can I return all 3? (NB. I do want to maintain the outer script tags in the results).

djdd87
  • 67,346
  • 27
  • 156
  • 195
  • 3
    Use an XML parser. Each time you parse XML with Regex, god kills a kitten. – scy Aug 12 '10 at 12:56
  • 2
    Please [don't](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454). Use an XML parser. – Stephen Aug 12 '10 at 12:57
  • 2
    You cannot reliably do this with Regex, as said many times here, HTML is not a regular language and therefore cannot be parsed with a Regular expression. You need an HTML parser. – Robin Day Aug 12 '10 at 12:57
  • And just for the sake of demonstrating this fact, your regex will kill too much of this: `

    Some text here.

    `
    – You Aug 12 '10 at 13:02
  • ...and too little of a script that contains "``". – You Aug 12 '10 at 13:04
  • @Scytale - Sod the kittens :) – djdd87 Aug 12 '10 at 13:28
  • @GenericTypeTea: I seem to be the only one who realises that your question is about regexes (specifically, why `.*?` didn’t do what you expected) and not really about parsing HTML. Perhaps in future questions, you can avoid receiving this flak by reducing the question to the essentials: in this case, you could have replaced “ – Timwi Aug 12 '10 at 13:35
  • @GenericTypeTea: Are you not interested in `` And if not try adding one at the top anyway and see what happens to your regex. – Robin Day Aug 12 '10 at 13:41
  • Also, dont forget to deal with scripts such as: `');` – Robin Day Aug 12 '10 at 13:43
  • @Robin - No I'm not. I actually really only care about ` – djdd87 Aug 12 '10 at 13:44
  • 2
    @Robin Day: Both of your examples (`` and using `` inside a script) are invalid HTML 4.01. – Timwi Aug 12 '10 at 14:00
  • @Timwi: Just because they're invalid HTML, people will still write it, a browser will still attempt to deal with it and a Regular Expression will absolutely NOT deal with it. People will always try and find tags using regex, it's just one of those things. Carry on! – Robin Day Aug 13 '10 at 06:52

3 Answers3

2

The . does not, by default, match newlines, so you will only get single-line results.

Use RegexOptions.Singleline to fix this. It changes the meaning of . to match any character, including the newline, so you get multi-line matches too.

Don’t get confused by the name. Also don’t confuse it with RegexOptions.Multiline, which is completely different (read the IntelliSense tooltips to find out).

djdd87
  • 67,346
  • 27
  • 156
  • 195
Timwi
  • 65,159
  • 33
  • 165
  • 230
  • 2
    This actually works well, quickly and gives me exactly what I want... I don't like kittens anyway, so I don't really care that much if God kills one because I use Regex. – djdd87 Aug 12 '10 at 13:27
1

You should use the HTML Agility Pack.

For example:

var doc = new HtmlDocument();
doc.Parse(source);

var scripts = doc.DocumentNode.Descendants("script"); 
Timwi
  • 65,159
  • 33
  • 165
  • 230
SLaks
  • 868,454
  • 176
  • 1,908
  • 1,964
0

Depending on the quality of your HTML.

var scripts = XDocument.Parse(HTMLSTRING).Descendants("SCRIPT");

Edit: Pre Xml.Linq version:

XmlDocument xDoc = new XmlDocument();
xDoc.Load(HTMLSTRING);
XmlNodeList scripts = xDoc.SelectNodes("//*/SCRIPT");

Note, both are those are untested....

Robin Day
  • 100,552
  • 23
  • 116
  • 167
  • Unfortunately I'm using c#2.0 on this project. Looks like it would of been a good solution though. – djdd87 Aug 12 '10 at 13:04
  • You can still use XmlDocument object. It's just more than one line of code. – Robin Day Aug 12 '10 at 13:05
  • Added, as I say though, untested, but you should get the idea. Biggest problem you will have though is if your HTML is valid XML or not. – Robin Day Aug 12 '10 at 13:08
  • Yeah, seems to have issues "There are multiple root elements.". There's a lot of 3rd party crap in this project. Namely Infragistics, so quality is a pretty far fetched idea. – djdd87 Aug 12 '10 at 13:12
  • Downvoted because the question is asking about *HTML*, not *XHTML*. `XDocument.Parse()` will completely fail and throw an exception for everything that isn’t XML, even when it’s valid HTML. – Timwi Aug 12 '10 at 13:25
  • @Timwi: I put a caveat at the top depending on the quality of the HTML. Also noted in comments that the HTML would have to be valid XML. It is an alternative answer showing one way of not using Regular Expressions. You may well be able to achieve this with Regex, however, it is a code smell, there will ALWAYS be a gotcha that will get you later on. – Robin Day Aug 12 '10 at 13:37
  • 1
    An XML parser will *completely fail on perfectly valid, high-quality HTML*. It won’t even output anything half-useful: it will just throw. – Timwi Aug 12 '10 at 13:55