I have a shell snippet that finds all external JavaScript scripts in thousands of random html pages, which use the <script src="…"
paradigm to include said scripts, with absolute URLs:
find ./ -type f -print0 | xargs -0 \
perl -nle 'print $1 \
while (m%<script[^>]+((https?:)?//[-./0-9A-Z\_a-z]+)%ig);'
Since scripts could also be loaded dynamically within JavaScript itself, I'd like to expand my snippet to match any absolute URL-like string which ends in .js
, and preferably appears within the script
tags. (This won't be 100% accurate, but would probably be good enough to find a few extra cases of external scripts.)
I'm thinking of something like <script[^>]*>.*["']((((https?)?:)?//)?[-.0-9A-Za-z]+\.[A-Za-z]{2,}/[-./0-9A-Z\_a-z]+\.js)
, and maybe also with .*</script>
at the end.
A tricky part comes in ensuring that multiple mentions of .js
within a script
results in multiple matches (which the regex above won't do by itself), but also that the two expressions that I have don't match in a way as to result in two outputs from a single mention of a given $1
matching string in the input.
What would be a good way to add this new regex to the perl snippet I have?