Be careful what you ask for!
The first two of your three objectives can be done fairly well using regex, but it is not trivial and is not 100% reliable - (see caveats below).
Picking strings out of JavaScript
First lets look at how to pick out single and double quoted sub-strings from a longer string of purely Javascript code (not HTML). Note that to do this correctly, a regex must not only match both types of quoted strings, it must also match both single and multi-line comments. This is because quotes may appear inside comments (e.g. /* I can't take it! */
), and these quotes must be ignored. Also, comment delimiters may appear inside quoted strings (e.g. var str = "This: /* can cause trouble too!";
), so all four constructs must be parsed out in one pass. Here is a regex which matches both types of comments and both types of quoted strings. It is presented in commented, verbose mode (using PHP single quoted syntax):
$re = '%# Parse comments and quoted strings from javascript code.
/\*[^*]*\*+(?:[^*/][^*]*\*+)*/ # A multi-line comment, or
| (\'[^\'\\\\]*(?:\\\\[\S\s][^\'\\\\]*)*\') # $1: Single quoted string, or
| ("[^"\\\\]*(?:\\\\[\S\s][^"\\\\]*)*") # $2: Double quoted string, or
| //.* # A single line comment.
%x';
This regex captures single quoted strings into group $1
and double quoted strings into group $2
, (and either type of string may contain escaped characters, e.g. 'That\'s cool!'
). Both types of comments are captured by the overall match when neither the $1
or $2
capture groups match. Note also, that this regex implements Jeffrey Friedl's "unrolling-the-loop" efficiency technique (See: Mastering Regular Expressions (3rd Edition)), so it is quite fast.
process_js()
The following Javascript function: process_js()
, implements the above regex (in non-verbose, native Javascript RegExp literal syntax). It performs a global (repetitive) replace using an anonymous function which processes the single and double quoted strings independently, and preserves all comments. Two additional functions: process_sq()
and process_dq()
perform the processing on the matched single and double quoted strings respectively:
function process_js(text) {
// Process single and double quoted strings outside comments.
var re = /\/\*[^*]*\*+(?:[^*\/][^*]*\*+)*\/|('[^'\\]*(?:\\[\S\s][^'\\]*)*')|("[^"\\]*(?:\\[\S\s][^"\\]*)*")|\/\/.*/g;
return text.replace(re,
function(m0, m1, m2){
if (m1) return process_sq(m1); // 'single-quoted'.
if (m2) return process_dq(m2); // "double-quoted".
return m0; // preserve comments.
});
}
function process_sq(text) {
return text.replace(/\bsomestring\b/g, 'SOMESTRING_SQ');
}
function process_dq(text) {
return text.replace(/\bsomestring\b/g, 'SOMESTRING_DQ');
}
Note that the two quoted-string handling functions merely replace the keyword: somestring
with SOMESTRING_SQ
and SOMESTRING_DQ
, so that the results of the processing will be evident. These functions are designed to be modified by the user as-needed. Lets see how this performs with a string of Javascript (similar to the example provided in the OP):
Test data input:
// comment with foo somestring bar in it
// comment with "foo somestring bar" in it
// comment with 'foo somestring bar' in it
/* comment with foo somestring bar in it */
/* comment with "foo somestring bar" in it */
/* comment with 'foo somestring bar' in it */
document.write(" with foo somestring bar in it ");
document.write(' with foo somestring bar in it ');
document.write(' with "foo somestring bar" in it ');
document.write(" with 'foo somestring bar' in it ");
var str = " with foo somestring bar in it ";
var str = ' with foo somestring bar in it ';
var str = ' with "foo somestring bar" in it ';
var str = " with 'foo somestring bar' in it ";
Test data output from process_js()
:
// comment with foo somestring bar in it
// comment with "foo somestring bar" in it
// comment with 'foo somestring bar' in it
/* comment with foo somestring bar in it */
/* comment with "foo somestring bar" in it */
/* comment with 'foo somestring bar' in it */
document.write(" with foo SOMESTRING_DQ bar in it ");
document.write(' with foo SOMESTRING_SQ bar in it ');
document.write(' with "foo SOMESTRING_SQ bar" in it ');
document.write(" with 'foo SOMESTRING_DQ bar' in it ");
var str = " with foo SOMESTRING_DQ bar in it ";
var str = ' with foo SOMESTRING_SQ bar in it ';
var str = ' with "foo SOMESTRING_SQ bar" in it ';
var str = " with 'foo SOMESTRING_DQ bar' in it ";
Notice that somestring
has been processed only within valid Javascript strings and has been ignored within comments. For pure Javascript, this function works pretty darn good!
Picking Javascript out of HTML
Parsing Javascript from HTML using regex is not recommended (see caveats below). However, a reasonably good job can be done if you are comfortable using a complex regex, and are happy with its limitations (once again, see caveats below). That said, here are the requirements for our regex solution: In HTML, Javascript can occur inside <SCRIPT>
elements, and within onclick
event tag attributes (and all the other HTML 4.01 events: ondblclick
, onmousedown
, onmouseup
, onmouseover
, onmousemove
, onmouseout
, onkeypress
, onkeydown
and onkeyup
). Javascript can also occur within javascript:
-pseudo-URLs, but IMHO, that is really bad practice, so this solution does not attempt to match these. HTML is a complex language and our scraping regex needs to ignore comments and CDATA
sections. A multi-global-alternative regex (similar to the previous one), matches each of these structures. Here is the "pluck-js-from-html" regex in commented, verbose mode (once again presented in PHP single quoted syntax):
$re = '%# (Unreliably) parse out javascript from HTML.
# Either... Option 1: SCRIPT element.
(<script\b[^>]*>) # $1: SCRIPT open tag.
([\S\s]*?) # $2: SCRIPT contents.
(<\/script\s*>) # $3: SCRIPT close tag.
# Or... Options 2 and 3: onclick=quoted-js-code.
| ( # $4: onXXXX =
\bon # All HTML 4.01 events...
(?:click|dblclick|mousedown|mouseup|mouseover|
mousemove|mouseout|keypress|keydown|keyup
)\s*=\s* # with = and optional ws.
) # End $4:
(?: # value alternatives are either
"([^"]*)" # $5: Double-quoted-js,
| \'([^\']*)\' # or $6: Single-quoted-js.
) # End group of alternatives.
# Or other HTML stuff that we should not mess with.
| <!--[\S\s]*?--> # HTML (non-SGML) comment.
| <!\[CDATA\[[\S\s]*?\]\]> # or CDATA section.
%ix';
In this regex, we capture SCRIPT
javascript content in groups: $1
(open tag), $2
(contents) and $3
(closing tag) and onXXX
event handler code in groups: $4
(event attribute name), $5
(double-quoted value contents) and $6
(single-quoted value contents). Comments and CDATA
sections are captured by the overall match (when none of the capture groups match). Note that this regex does not make use of the "unrolling-the-loop" technique (although it certainly could), because that would add too much complexity for most readers. (All three of the lazy-dot-star expressions i.e. [\S\s]*?
, can be unrolled to speed this up.)
process_html()
The following Javascript function: process_html()
, implements the above regex (in non-verbose, native Javascript RegExp literal syntax). It performs a global (repetitive) replace using an anonymous function which processes the three different sources of javascript code. It then calls the previously described process_js()
function to process the captured js code. Here it is:
function process_html(text) {
// Pick out javascript from HTML event attributes and SCRIPT elements.
var re = /(<script\b[^>]*>)([\S\s]*?)(<\/script\s*>)|(\bon(?:click|dblclick|mousedown|mouseup|mouseover|mousemove|mouseout|keypress|keydown|keyup)\s*=\s*)(?:"([^"]*)"|'([^']*)')|<!--[\S\s]*?-->|<!\[CDATA\[[\S\s]*?\]\]>/g;
// Regex to match <script> element
return text.replace(re,
function(m0, m1, m2, m3, m4, m5, m6) {
if (m1) { // Case 1: <script> element.
m2 = process_js(m2);
return m1 + m2 + m3;
}
if (m4) { // Case 2: onXXX event attribute.
if (m5) { // Case 2a: double quoted.
m5 = process_js(m5);
return m4 + '"' + m5 + '"';
}
if (m6) { // Case 2b: single quoted.
m6 = process_js(m6);
return m4 + "'" + m6 + "'";
}
}
return m0; // Else return other non-js matches unchanged.
});
}
Test data input:
<script>
/* comment with 'foo somestring bar' in it */
document.write(" with 'foo somestring bar' in it ");
var str = " with 'foo somestring bar' in it ";
</script>
<!-- with foo somestring bar in it -->
<!-- with "foo somestring bar" in it -->
<!-- with 'foo somestring bar' in it -->
<![CDATA[ with foo somestring bar in it ]]>
<![CDATA[ with "foo somestring bar" in it ]]>
<![CDATA[ with 'foo somestring bar' in it ]]>
<p>non-js with foo somestring bar in it non-js</p>
<p>non-js with "foo somestring bar" in it non-js</p>
<p>non-js with 'foo somestring bar' in it non-js</p>
<p onclick="with foo somestring bar in it">stuff</p>
<p onclick="with 'foo somestring bar' in it">stuff</p>
<p onclick='with foo somestring bar in it'>stuff</p>
<p onclick='with "foo somestring bar" in it'>stuff</p>
Test data output from process_html()
:
<script>
/* comment with 'foo somestring bar' in it */
document.write(" with 'foo SOMESTRING_DQ bar' in it ");
var str = " with 'foo SOMESTRING_DQ bar' in it ";
</script>
<!-- with foo somestring bar in it -->
<!-- with "foo somestring bar" in it -->
<!-- with 'foo somestring bar' in it -->
<![CDATA[ with foo somestring bar in it ]]>
<![CDATA[ with "foo somestring bar" in it ]]>
<![CDATA[ with 'foo somestring bar' in it ]]>
<p>non-js with foo somestring bar in it non-js</p>
<p>non-js with "foo somestring bar" in it non-js</p>
<p>non-js with 'foo somestring bar' in it non-js</p>
<p onclick="with foo somestring bar in it">stuff</p>
<p onclick="with 'foo SOMESTRING_SQ bar' in it">stuff</p>
<p onclick='with foo somestring bar in it'>stuff</p>
<p onclick='with "foo SOMESTRING_DQ bar" in it'>stuff</p>
As you can see, this works pretty darn good and correctly modifies only quoted strings within javascript within HTML.
Caveats: To correctly and reliably extract Javascript from HTML, (i.e. parse it) you must use a parser. Although the above algorithm does a pretty decent job, there are certainly cases where it will fail. For example the following non-javascript code will be matched:
<p title="Title with onclick='fake code erroneously matched here!'">stuff</p>
<p title='onclick="alert('> and somestring here too </p><p title=');"'>stuff</p>
<p title='<script>alert("Bad medicine!");</script>'>stuff</p>
Phew!