0

I'm trying to write a regex that searches a page for any script tags and extracts the script content, and in order to accommodate any HTML-writing style, I want my regex to include script tags with any arbitrary number of whitespace characters (e.g. <script type = blahblah> and <script type=blahblah> should both be found). My first attempt ended up with funky results, so I broke down the problem into something simpler, and decided to just test and play around with a regex like /\s*h\s*/g.

When testing it out on string, for some reason completely arbitrary amounts of whitespace around the 'h' would be a match, and other arbitrary amounts wouldn't, e.g. something like " h " would match but " h " wouldn't. Does anyone have an idea of why this occurring or the the error I'm making?

Niet the Dark Absol
  • 320,036
  • 81
  • 464
  • 592
  • 6
    [The pony he comes...](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) – Niet the Dark Absol Jul 06 '12 at 19:10
  • 1
    Can you give some *specific* examples. Show the *exact* code you tried and the *exact* input strings you used to test it. – Mark Byers Jul 06 '12 at 19:10
  • ...but the question now is 'something simpler' like /\s*h\s*/g. I'm not sure the question is specifically about matching HTML any more - it's about an observed/perceived oddity. – Lee Kowalkowski Jul 06 '12 at 19:19

2 Answers2

2

Since you're using JavaScript, why can't you just use getElementsByTagName('script')? That's how you should be doing it.

If you somehow have an HTML string, create an iframe and dump the HTML into it, then run getElementsByTagName('script') on it.

Niet the Dark Absol
  • 320,036
  • 81
  • 464
  • 592
  • It is an HTML string, and dumping it into an iframe and loading the HTML would involve me putting all the work that comes after this extraction into an event listener for when the HTML actually loads, which seems like more trouble than it's worth considering the complexity of that work and when I already have almost of the work for the regular expression method done. – user1507608 Jul 06 '12 at 19:32
  • @user1507608: so what's your specific regex question? Because `/\s*h\s*/g` matches `'h'`, `' h '`, and `' h '` (with more space). Although if you have the global switch `/g` which does nothing with those test strings. So if you want help, you need to elaborate a little more. Otherwise your question is unanswerable. – Lee Kowalkowski Jul 06 '12 at 19:40
  • You don't need any load handler, just create the iframe and use `document.write()` to shove the HTML string in there. – Niet the Dark Absol Jul 06 '12 at 19:43
  • Ok, I had gotten the impression that the loading was asynchronous, because when I tried to access the DOM immediately after using document.write() the entire page had not loaded. But maybe that was because the page I was loading had parts it itself loaded asynchronously. – user1507608 Jul 06 '12 at 20:19
  • What are you loading the iframe? You should just load `about:blank` (or nothing at all) – Niet the Dark Absol Jul 06 '12 at 20:45
0

OK, to extend Kolink's answer, you don't need an iframe, or event handlers:

var temp = document.createElement('div');
temp.innerHTML = otherHtml;
var scripts = temp.getElementsByTagName('script');

... now scripts is a DOM collection of the script elements - and the script doesn't get executed ...


Why regex is not a fantastic idea for this:

As a <script> element may not contain the string </script> anywhere, writing a regex to match them would not be difficult: /<script[.\n]+?<\/script>/gi

It looks like you want to only match scripts with a specific type attribute. You could try to include that in your pattern too: /<script[^>]+type\s*=\s*(["']?)blahblah\1[.\n]*?<\/script>/gi - but that is horrible. (That's what happens when you use regular expressions on irregular strings, you need to simplify)

So instead you iterate through all the basic matched scripts, extract the starting tag: result.match(/<script[^>]*>/i)[0] and within that, search for your type attribute /type\s*=\s*((["'])blahblah\2|\bblahblah\b)/.test(startTag). Oh look - it's back to horrible - simplify!

This time via normalisation: startTag = startTag.replace(/\s*=\s*/g, '=').replace(/=([^\s"'>]+)/g, '="$1"') - now you're in danger territory, what if the = is inside a quoted string? Can you see how it just gets more and more complicated?

You can only have this work using regex if you make robust assumptions about the HTML you'll use it on (i.e. to make it regular). Otherwise your problems will grow and grow and grow!

  • disclaimer: I haven't tested any of the regex used to see if they do what I say they do, they're just example attempts.
Lee Kowalkowski
  • 11,591
  • 3
  • 40
  • 46