0

I have a shell snippet that finds all external JavaScript scripts in thousands of random html pages, which use the <script src="…" paradigm to include said scripts, with absolute URLs:

find ./ -type f -print0 | xargs -0 \
    perl -nle 'print $1 \
        while (m%<script[^>]+((https?:)?//[-./0-9A-Z\_a-z]+)%ig);'

Since scripts could also be loaded dynamically within JavaScript itself, I'd like to expand my snippet to match any absolute URL-like string which ends in .js, and preferably appears within the script tags. (This won't be 100% accurate, but would probably be good enough to find a few extra cases of external scripts.)

I'm thinking of something like <script[^>]*>.*["']((((https?)?:)?//)?[-.0-9A-Za-z]+\.[A-Za-z]{2,}/[-./0-9A-Z\_a-z]+\.js), and maybe also with .*</script> at the end.

A tricky part comes in ensuring that multiple mentions of .js within a script results in multiple matches (which the regex above won't do by itself), but also that the two expressions that I have don't match in a way as to result in two outputs from a single mention of a given $1 matching string in the input.

What would be a good way to add this new regex to the perl snippet I have?

Toto
  • 89,455
  • 62
  • 89
  • 125
cnst
  • 25,870
  • 6
  • 90
  • 122
  • You're not going to find `` *in* the external js files themselves though? You'll find them all linked one after the other in the html page. The only thing I could see is parsing for jQuery's [.getScript](http://api.jquery.com/jQuery.getScript/), which would be a lot simpler in regex. – brandonscript Dec 05 '13 at 02:29
  • @r3mus, thanks for bringing `.getScript` to my attention, but if you already have jQuery, you're not very likely to be loading external scripts with it; the idea is to find stuff like embedded third-party script references which the author of the page simply copy-pasted to enable some sort of tracking or something. – cnst Dec 05 '13 at 02:40
  • Even so, you still won't find it between tags. You might have better luck just looking for .js and back referencing until you find an invalid filename/path char? – brandonscript Dec 05 '13 at 02:43
  • @r3mus, I don't understand what you mean. You realise this is all html files, and we're looking at javascript right within the html files? – cnst Dec 05 '13 at 02:49
  • The way you worded the question it looks like you're searching through the loaded .js files to search for *more* dynamically loaded .js files. Am I mistaken? – brandonscript Dec 05 '13 at 02:52
  • 1
    I would avoid using regexes here, and make a small program to parse the HTML and find the – Matthew Lock Dec 05 '13 at 03:08
  • 1
    @MatthewLock definitely a more reliable and comprehensive way to go - Mojo is excellent. That said, if cnst has already built the script, changing it might be an unnecessary undertaking. – brandonscript Dec 05 '13 at 03:13

1 Answers1

0

A tricky part comes in ensuring that multiple mentions of .js within a script results in multiple matches (which the regex above won't do by itself)…

This can be accomplished by splitting your contemplated regular expression in two - one part for the <script> tag, the other part for the .js matches - and invoking the parts in nested loops; that nesting is made possible by the modifier c, which prevents the current position in the line from being reset after failure to match, and the \G anchor, which matches at the point where the previous g match left off.

… also that the two expressions that I have don't match in a way as to result in two outputs from a single mention of a given $1 matching string in the input.

This is ensured by the first expression only matching within a <script …> tag and the second expressions only matching between <script> and </script> tags.

So, the perl part of your shell snippet could look like:

    perl -nle '
    print $1 while m%<script[^>]+((https?:)?//[-./0-9A-Z\_a-z]+)%ig;
    while (m%<script[^>]*>%ig)  # for each <script> tag
    {
     print $2 while m%          # allow multiple mentions of `.js`
     \G((?!</script>).)*?       # do not pass over </script>, be non-greedy
     ["'"'](((https?:)?//)?[-.0-9A-Za-z]+\.[A-Za-z]{2,}/[-./0-9A-Z\_a-z]+\.js)
                     %ixgc      # c: keep the Current position for outer loop
    }"
Armali
  • 18,255
  • 14
  • 57
  • 171