-1

I have the regex below that iterates through a string and then I build an array of script urls. This breaks if the script also has a type or id specified, is there a way to ignore certain arrtibutes on the script tags such as id, class, type, etc?

var regSrc = /<script.*?src="(.*?)"><\/script>/gmi;
user1572796
  • 1,057
  • 2
  • 21
  • 46

4 Answers4

3

Don't use regex to parse HTML. Use the DOM instead. It's much less painful:

function get_script_src_from_string (INPUT_STRING) {

  var tempDiv = document.createElement('div');
  tempDiv.innerHTML = INPUT_STRING;

  var scripts = tempDiv.getElementsByTagName('script');
  var script_urls = [];
  for (var i=0; i<scripts.length; i++) {
    script_urls.push(scripts[i].src);
  }
  return script_urls;

}

Works in all browsers, easier to understand and does not have edge cases.

Since scripts only get downloaded when the element is added to document it won't get downloaded if you never appendChild the temporary div.

Community
  • 1
  • 1
slebetman
  • 109,858
  • 19
  • 140
  • 171
  • the elements are not in the dom in this case they are coming in with a string. Yes this is terrible but it is the project I am working on. – user1572796 Sep 26 '14 at 22:20
  • Look at my code. The input_string is a string not in the DOM – slebetman Sep 26 '14 at 22:21
  • Your web browser includes a very robust HTML parser and it's called `innerHTML`. Use it. – slebetman Sep 26 '14 at 22:22
  • Or is the code running in node.js and you don't have access to the DOM? If so there are also solutions that provides a virtual DOM in node. – slebetman Sep 26 '14 at 22:27
  • You're right, really need to manipulate the dom vs string now that this is getting more complicated. – user1572796 Sep 29 '14 at 22:15
  • For some reason IE8 ignores script tags but recognizes every other tag in a string, if you think of a fix for that let me know. Thanks again – user1572796 Sep 29 '14 at 23:50
  • fix for IE8 is appending a character to the innerHTML string then removing the character. tempDiv.innerHTML = 'X' + content; ... – user1572796 Oct 01 '14 at 18:33
  • Hmm. that's strange indeed. The only reason I can see why you'd need that is that your content string is not well-formed HTML and adding text in front of it forced IE to parse it slightly differently – slebetman Oct 02 '14 at 02:56
  • Yeah very strange, I went through the jquery library and it looks like that was their fix for this issue. – user1572796 Oct 02 '14 at 17:24
0

Try this regex:

/<script.*src="([^"]*).*><\/script>/

It will match any script tag (with a src) and ignore all attributes but src

Johan Karlsson
  • 6,419
  • 1
  • 19
  • 28
0
/<script.*?src="([^"]*)"[^>]*><\/script>/gmi
Unihedron
  • 10,902
  • 13
  • 62
  • 72
  • Almost it : don't use ? next to * Also, with this regex, `` : only the second.js will be captured – laruiss Sep 26 '14 at 22:32
0

Just for the sake of the principle (and for fun), I'll give my regex :

var regSrc = /<script(:? [a-z]+="[^"]*"| [a-z]+='[^']*')* src="([^"]*)"[^>]*><\/script>/gmi;

But the @slebetman answer is the right one and should be validated. (And this regex will not capture the src if it is written with simple quotes src='path/to/whatever.js', but seems safer than the one already given).

laruiss
  • 3,780
  • 1
  • 18
  • 29