1

So I have a string and I want to take a part of the string, that it matches. For example:
stringToFind:
"<html><head><script src="http://example.com"></script>..."
I need to get the source of the script tag. I need to find all instances of the script tag in the string, and get the url of the source.

I was going to use String.prototype.replace(), that uses regular expressions, but you have to replace it with something and the result is the whole string.

  • 1
    If your input string is well-formed HTML, I recommend you use an HTML parser rather than regex. – CAustin May 27 '23 at 03:31
  • Obligatory link: [You can't parse \[X\]HTML with regex.](https://stackoverflow.com/a/1732454) – InSync May 27 '23 at 06:36
  • @CAustin exactly! javascript actually has a built in XML/HTML parser called `DOMParser` (see my answer). Even if the html is badly formed, there is probably a library for fixing broken html one could use before parsing. – Levi May 27 '23 at 07:03

2 Answers2

1

const sources = document.documentElement.innerHTML.match(/(?<=<script.+src=")[^"]+(?=".*>\s*<\/script\s*>)/g);
console.log(sources);
<script 
  src="https://cdnjs.cloudflare.com/ajax/libs/knockout/3.4.2/knockout-min.js"
  type="text/javascript"
></script>
<script defer src="https://cdnjs.cloudflare.com/ajax/libs/d3/5.7.0/d3.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.3.1/jquery.min.js"></script>
Alexander Nenashev
  • 8,775
  • 2
  • 6
  • 17
1

javascript is pretty good at handling HTML markup, so you probably don't need a regex here.

this should do the trick:

//<!-- 
var html_code = `<html>
<head>
<script src="https://code.jquery.com/jquery-3.7.0.slim.min.js"></script>
<script src="/assets/script.js"></script>
</head>
<body>
  <script>
    var code = 'nope';
  </script>
  <p>Other stuff</p>
  <footer><script src="/assets/footer.js"></script></footer>
  </body>
</html>`;
// -->

const parser = new DOMParser();
const html_doc = parser.parseFromString(html_code, 'text/html');

const script_tags = html_doc.querySelectorAll('script[src]');
const sources = Array.from(script_tags).map((s) => s.getAttribute('src'));

console.log(sources);

if you need to extract script tags from the DOM in the browser, then you only need this: console.log( Array.from(document.querySelectorAll('script[src]')).map((s) => s.getAttribute('src')) );


side note: // <!-- and // --> is there to make the jsfiddle run (which it wont reliably when containing html code in strings) as suggested by @InSync

Levi
  • 661
  • 7
  • 26
  • 1
    A trick you may utilize: Just enclose `html_code` in `// ` (or even `` for that matter). – InSync May 27 '23 at 07:10
  • Wow, I never knew that about html commenting JavaScript code to have html strings. I was having a problem like that before. – Cole Brennan Jun 08 '23 at 01:47