1

I am searching for text inside of website resources (html and javascript), and need to identify 3 regular expressions that will locate this text under certain circumstances:

  1. some string of text when it is contained inside of a javascript single-quoted string
  2. some string of text when it is contained inside of a javascript double-quoted string
  3. some string of text when it is not contained inside of a javascript string

Here are some scenarios that are likely to occur (searching for the string "somestring"):

document.write("here is a bunch of text and somestring is inside of it");
var thing = 'here is a bunch of text and somestring is inside of it';
document.write("some text and 'quote' and then somestring here");
document.write('some text and "quote" and then somestring here');
var thing = "some text and '" + quotedVar + "' and then somestring here");
document.write('some text and "' + quotedVar + '" and then ' + " more " + "somestring here");
this string is outside javascript and here is a 'quoted' line and somestring is after it
this string is outside javascript and here is a "quoted" line and somestring is after it

These examples might all appear inside the same file, and so the regular expressions should not assume single-case scenarios.

I have tried the following for finding single-quoted and double-quoted strings, but alas I have failed miserably:

single quotes:

([=|(|\+]\s*?'[^']*?(?:'[^']*?'[^']*?)*?somestring)

double quotes:

([=|(|\+]\s*?"[^"]*?(?:"[^"]*?"[^"]*?)*?somestring)

These work when assuming the right conditions, but there are many real-world scenarios that I have tried (read, real javascript files) where they fail. Any help is greatly appreciated!

Edit: For clarification, I am looking for 3 regular expression for each of the conditions listed above, not one that covers all cases.

SventoryMang
  • 10,275
  • 15
  • 70
  • 113
  • 3
    You're not going to be able to do this with regular expressions if you really have to handle general-case source files. – Pointy Apr 08 '11 at 16:08
  • Can you expand further pointy as to why not? What do you recommend then? – SventoryMang Apr 08 '11 at 16:09
  • Looks like you're trying to locate somestring no matter where it appears - within javascript or outside it. Why don't you just look for somestring? Or do you actually need to be able to tell which of the three scnearios it is? – Elad Apr 08 '11 at 16:14
  • 1
    It's because the source text that you're attempting to parse is constructed with grammars that cannot be parsed by regular expressions. It's just one of those things; a regular expression can only deal with a certain class of grammar. – Pointy Apr 08 '11 at 16:14
  • Elad, it's mandatory that I know which of the 3 scenarios the text was located, I need to know if the text was contained in single quotes, double quotes, or neither. – SventoryMang Apr 08 '11 at 16:29
  • Pointy is right. This touches on the same principles involved in the famous "parsing html with regex debate"... http://danielmiessler.com/blog/stack-overflows-answer-to-can-you-parse-html-with-regex – KeatsKelleher Apr 08 '11 at 16:29
  • http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags (for reference to "the debate") –  Apr 08 '11 at 17:23

3 Answers3

2

Be careful what you ask for!

The first two of your three objectives can be done fairly well using regex, but it is not trivial and is not 100% reliable - (see caveats below).

Picking strings out of JavaScript

First lets look at how to pick out single and double quoted sub-strings from a longer string of purely Javascript code (not HTML). Note that to do this correctly, a regex must not only match both types of quoted strings, it must also match both single and multi-line comments. This is because quotes may appear inside comments (e.g. /* I can't take it! */), and these quotes must be ignored. Also, comment delimiters may appear inside quoted strings (e.g. var str = "This: /* can cause trouble too!";), so all four constructs must be parsed out in one pass. Here is a regex which matches both types of comments and both types of quoted strings. It is presented in commented, verbose mode (using PHP single quoted syntax):

$re = '%# Parse comments and quoted strings from javascript code.
      /\*[^*]*\*+(?:[^*/][^*]*\*+)*/             # A multi-line comment, or
    | (\'[^\'\\\\]*(?:\\\\[\S\s][^\'\\\\]*)*\')  # $1: Single quoted string, or
    | ("[^"\\\\]*(?:\\\\[\S\s][^"\\\\]*)*")      # $2: Double quoted string, or
    | //.*                                       # A single line comment.
    %x';

This regex captures single quoted strings into group $1 and double quoted strings into group $2, (and either type of string may contain escaped characters, e.g. 'That\'s cool!'). Both types of comments are captured by the overall match when neither the $1 or $2 capture groups match. Note also, that this regex implements Jeffrey Friedl's "unrolling-the-loop" efficiency technique (See: Mastering Regular Expressions (3rd Edition)), so it is quite fast.

process_js()
The following Javascript function: process_js(), implements the above regex (in non-verbose, native Javascript RegExp literal syntax). It performs a global (repetitive) replace using an anonymous function which processes the single and double quoted strings independently, and preserves all comments. Two additional functions: process_sq() and process_dq() perform the processing on the matched single and double quoted strings respectively:

function process_js(text) { 
    // Process single and double quoted strings outside comments.
    var re = /\/\*[^*]*\*+(?:[^*\/][^*]*\*+)*\/|('[^'\\]*(?:\\[\S\s][^'\\]*)*')|("[^"\\]*(?:\\[\S\s][^"\\]*)*")|\/\/.*/g;
    return text.replace(re,
        function(m0, m1, m2){
            if (m1) return process_sq(m1);  // 'single-quoted'.
            if (m2) return process_dq(m2);  // "double-quoted".
            return m0;                      // preserve comments.
        });
}
function process_sq(text) {
    return text.replace(/\bsomestring\b/g, 'SOMESTRING_SQ');
}
function process_dq(text) {
    return text.replace(/\bsomestring\b/g, 'SOMESTRING_DQ');
}

Note that the two quoted-string handling functions merely replace the keyword: somestring with SOMESTRING_SQ and SOMESTRING_DQ, so that the results of the processing will be evident. These functions are designed to be modified by the user as-needed. Lets see how this performs with a string of Javascript (similar to the example provided in the OP):

Test data input:

// comment with  foo somestring bar  in it
// comment with "foo somestring bar" in it
// comment with 'foo somestring bar' in it
/* comment with  foo somestring bar  in it */
/* comment with "foo somestring bar" in it */
/* comment with 'foo somestring bar' in it */

document.write(" with  foo somestring bar  in it ");
document.write(' with  foo somestring bar  in it ');
document.write(' with "foo somestring bar" in it ');
document.write(" with 'foo somestring bar' in it ");

var str = " with  foo somestring bar  in it ";
var str = ' with  foo somestring bar  in it ';
var str = ' with "foo somestring bar" in it ';
var str = " with 'foo somestring bar' in it ";

Test data output from process_js() :

// comment with  foo somestring bar  in it
// comment with "foo somestring bar" in it
// comment with 'foo somestring bar' in it
/* comment with  foo somestring bar  in it */
/* comment with "foo somestring bar" in it */
/* comment with 'foo somestring bar' in it */

document.write(" with  foo SOMESTRING_DQ bar  in it ");
document.write(' with  foo SOMESTRING_SQ bar  in it ');
document.write(' with "foo SOMESTRING_SQ bar" in it ');
document.write(" with 'foo SOMESTRING_DQ bar' in it ");

var str = " with  foo SOMESTRING_DQ bar  in it ";
var str = ' with  foo SOMESTRING_SQ bar  in it ';
var str = ' with "foo SOMESTRING_SQ bar" in it ';
var str = " with 'foo SOMESTRING_DQ bar' in it ";

Notice that somestring has been processed only within valid Javascript strings and has been ignored within comments. For pure Javascript, this function works pretty darn good!

Picking Javascript out of HTML

Parsing Javascript from HTML using regex is not recommended (see caveats below). However, a reasonably good job can be done if you are comfortable using a complex regex, and are happy with its limitations (once again, see caveats below). That said, here are the requirements for our regex solution: In HTML, Javascript can occur inside <SCRIPT> elements, and within onclick event tag attributes (and all the other HTML 4.01 events: ondblclick, onmousedown, onmouseup, onmouseover, onmousemove, onmouseout, onkeypress, onkeydown and onkeyup). Javascript can also occur within javascript:-pseudo-URLs, but IMHO, that is really bad practice, so this solution does not attempt to match these. HTML is a complex language and our scraping regex needs to ignore comments and CDATA sections. A multi-global-alternative regex (similar to the previous one), matches each of these structures. Here is the "pluck-js-from-html" regex in commented, verbose mode (once again presented in PHP single quoted syntax):

$re = '%# (Unreliably) parse out javascript from HTML.
    # Either... Option 1: SCRIPT element.
      (<script\b[^>]*>)         # $1: SCRIPT open tag.
      ([\S\s]*?)                # $2: SCRIPT contents.
      (<\/script\s*>)           # $3: SCRIPT close tag.
    # Or... Options 2 and 3: onclick=quoted-js-code.
    | (                         # $4: onXXXX = 
        \bon                    # All HTML 4.01 events...
        (?:click|dblclick|mousedown|mouseup|mouseover|
           mousemove|mouseout|keypress|keydown|keyup
        )\s*=\s*                # with = and optional ws.
      )                         # End $4:
      (?:                       # value alternatives are either
        "([^"]*)"               #    $5: Double-quoted-js,
      | \'([^\']*)\'            # or $6: Single-quoted-js.
      )                         # End group of alternatives.
    # Or other HTML stuff that we should not mess with.
    | <!--[\S\s]*?-->           # HTML (non-SGML) comment.
    | <!\[CDATA\[[\S\s]*?\]\]>  # or CDATA section.
    %ix';

In this regex, we capture SCRIPT javascript content in groups: $1 (open tag), $2 (contents) and $3 (closing tag) and onXXX event handler code in groups: $4 (event attribute name), $5 (double-quoted value contents) and $6 (single-quoted value contents). Comments and CDATA sections are captured by the overall match (when none of the capture groups match). Note that this regex does not make use of the "unrolling-the-loop" technique (although it certainly could), because that would add too much complexity for most readers. (All three of the lazy-dot-star expressions i.e. [\S\s]*?, can be unrolled to speed this up.)

process_html()
The following Javascript function: process_html(), implements the above regex (in non-verbose, native Javascript RegExp literal syntax). It performs a global (repetitive) replace using an anonymous function which processes the three different sources of javascript code. It then calls the previously described process_js() function to process the captured js code. Here it is:

function process_html(text) {
    // Pick out javascript from HTML event attributes and SCRIPT elements.
    var re = /(<script\b[^>]*>)([\S\s]*?)(<\/script\s*>)|(\bon(?:click|dblclick|mousedown|mouseup|mouseover|mousemove|mouseout|keypress|keydown|keyup)\s*=\s*)(?:"([^"]*)"|'([^']*)')|<!--[\S\s]*?-->|<!\[CDATA\[[\S\s]*?\]\]>/g;
    // Regex to match <script> element
    return text.replace(re,
        function(m0, m1, m2, m3, m4, m5, m6) {
            if (m1) { // Case 1: <script> element.
                m2 = process_js(m2);
                return m1 + m2 + m3;
            }
            if (m4) { // Case 2: onXXX event attribute.
                if (m5) {  // Case 2a: double quoted.
                    m5 = process_js(m5);
                    return m4 + '"' + m5 + '"';
                }
                if (m6) {  // Case 2b: single quoted.
                    m6 = process_js(m6);
                    return m4 + "'" + m6 + "'";
                }
            }
            return m0; // Else return other non-js matches unchanged.
        });
}

Test data input:

<script>
/* comment with 'foo somestring bar' in it */
document.write(" with 'foo somestring bar' in it ");
var str = " with 'foo somestring bar' in it ";
</script>

<!-- with  foo somestring bar  in it -->
<!-- with "foo somestring bar" in it -->
<!-- with 'foo somestring bar' in it -->

<![CDATA[ with  foo somestring bar  in it ]]>
<![CDATA[ with "foo somestring bar" in it ]]>
<![CDATA[ with 'foo somestring bar' in it ]]>

<p>non-js with  foo somestring bar  in it non-js</p>
<p>non-js with "foo somestring bar" in it non-js</p>
<p>non-js with 'foo somestring bar' in it non-js</p>

<p onclick="with  foo somestring bar  in it">stuff</p>
<p onclick="with 'foo somestring bar' in it">stuff</p>
<p onclick='with  foo somestring bar  in it'>stuff</p>
<p onclick='with "foo somestring bar" in it'>stuff</p>

Test data output from process_html() :

<script>
/* comment with 'foo somestring bar' in it */
document.write(" with 'foo SOMESTRING_DQ bar' in it ");
var str = " with 'foo SOMESTRING_DQ bar' in it ";
</script>

<!-- with  foo somestring bar  in it -->
<!-- with "foo somestring bar" in it -->
<!-- with 'foo somestring bar' in it -->

<![CDATA[ with  foo somestring bar  in it ]]>
<![CDATA[ with "foo somestring bar" in it ]]>
<![CDATA[ with 'foo somestring bar' in it ]]>

<p>non-js with  foo somestring bar  in it non-js</p>
<p>non-js with "foo somestring bar" in it non-js</p>
<p>non-js with 'foo somestring bar' in it non-js</p>

<p onclick="with  foo somestring bar  in it">stuff</p>
<p onclick="with 'foo SOMESTRING_SQ bar' in it">stuff</p>
<p onclick='with  foo somestring bar  in it'>stuff</p>
<p onclick='with "foo SOMESTRING_DQ bar" in it'>stuff</p>

As you can see, this works pretty darn good and correctly modifies only quoted strings within javascript within HTML.

Caveats: To correctly and reliably extract Javascript from HTML, (i.e. parse it) you must use a parser. Although the above algorithm does a pretty decent job, there are certainly cases where it will fail. For example the following non-javascript code will be matched:

<p title="Title with onclick='fake code erroneously matched here!'">stuff</p>
<p title='onclick="alert('> and somestring here too </p><p title=');"'>stuff</p>
<p title='<script>alert("Bad medicine!");</script>'>stuff</p>

Phew!

ridgerunner
  • 33,777
  • 5
  • 57
  • 69
  • Ridge you are ridiculous! Checking this now but +1 anyway for your awesome regex skills (you answered another regex question of mine previously and it worked). – SventoryMang Apr 11 '11 at 14:39
0

Consider an initial 'parse' (I use the term loosely) which generates three different resultant streams -- one for each of the search domains.

In this stage just increment-step through the file stopping on the tokens /, ' and " as these change the 'context' (possible comment, regex, or string). Then determine (for the / case) and consume the context contents and put it into the appropriate resultant stream. (Finding the end is still a little bit tricky in cases like "foo\"bar\\", but much less tricky than a regex trying to match the contexts in a search.)

When this stage is done -- besides being verifiable -- each of the individual streams can be easily searched independently.

Happy coding.

0

Three regular expressions can't correctly handle this in all cases because JavaScript has no regular lexical grammar : it is not possible to always identify whether a quote starts a string.

Even assuming you can correctly identify and ignore quotes inside comments, quotes inside regular expressions will foil you.

For example,

x++/y - "42" /i

vs

x = ++/y - "42"/i

In the first case, the quotes are part of a string. The first sample is the same as

((x++) / (y - 42)) / i

but in the second case, the quotes are not part of the string. It is the same as

x = ++(new RegExp('y - "42"', 'i'))

which is a syntactically valid, but nonsensical JavaScript statement.

If you're willing to ignore comments and weird constructs like this, then you can match strings using

/"(?:[^"\\]|\\(?:[^\r]|\r\n?))*"/

and

/'(?:[^'\\]|\\(?:[^\r]|\r\n?))*'/

which will match EcmaScript 5 style strings with line continuations.

Mike Samuel
  • 118,113
  • 30
  • 216
  • 245