2

I'm trying to catch as many Javascript redirects as possible from many HTML pages. My regular expression is:

((location.href)|(window.location)|(location.replace)|(location.assign))(( ?= ?)|( ?\( ?))("|')([^'"]*)("|')( ?\) ?)?;

I use Python but the question is general:

regex = re.compile(r"""((location.href)|(window.location)|(location.replace)|(location.assign))(( ?= ?)|( ?\( ?))("|')([^'"]*)("|')( ?\) ?)?;""", re.I)
# ... some control here ...
print re.search(regex, html).group(10)  # 10 is the pure url

I did some tests and I was able to catch all these cases.

location.href = "http://www.foo.com";
location.href="http://www.foo.com";
window.location = "http://www.foo.com";
window.location.href = "http://www.foo.com";
location.replace ("http://www.foo.com");
location.replace( "http://www.foo.com" ) ;
location.assign ("http://www.foo.com");

And skip where I can't resolve an URL because the code contains a variable:

location.href = "http://www.foo.com" + var + "something else";

The questions are:

  1. Are there other ways to redirect using Javascript? Other location.somethingelse that I am missing?
  2. Is the way I catch these 4 cases correct? Is it allowed to have something like location.href = http://www.foo.com; or location.replace (http://www.foo.com); that I'll miss because of the (double) quotes? Am I too strict or too lax?
  3. Is my regex well written? Or can I improve it in some way?
dda
  • 6,030
  • 2
  • 25
  • 34
  • 2
    There's also `document.location` – mrk Nov 13 '12 at 15:09
  • The URL should be between double quotes OR single quotes. You'll need to check for both since they are both valid ways to enclose a string in JavaScript. – Matt Burland Nov 13 '12 at 15:15
  • @MattBurland, yes, I think I check that with the `("|')` part before and after the url. Am I doing this wrongly? –  Nov 13 '12 at 15:21
  • @Luca: No, your are correct. I missed the `("|')` part in you regex. – Matt Burland Nov 13 '12 at 16:35
  • I had just started the task trying to write regex to do this for my script. Looked everywhere but found your post. Thank you for saving my time, I'll let you know if I think of anything else. – John Z Mar 02 '17 at 19:44
  • I did notice that one thing you do not handle are meta http-requiv refreshes. I understand its an old post just wanted to make a mention of it! – John Z Mar 02 '17 at 20:24

1 Answers1

0

In general, you cannot parse programming languages with regexes (well, theoretically you can, but it's extremely unpractical). This is especially true for javascript because of its highly dynamic nature. For example,

 window['loc' + 'a' + 'tion'][['h','r','e','f'].join('')] = 'something'.replace(/s/, etc...)

That said, here's an expression that at least passes your tests (broken down for clarity):

# quoted string
str = r"""
    ' (?:\\.|[^'])* '
    |
    " (?:\\.|[^"])* "
"""
# dotted reference to "location"
loc = r"""
    (?: \w+\.)*
    \b location \b
    (?: \.\w+)*
"""

# ref=string or ref(string)
expr = r"""
    ({0})
    \s*
    (?:
        = \s* ({1})
        |
        \( \s* ({1}) \s* \)
    )
    \s*
    ;
""".format(loc, str)

Compile this in extended mode, e.g.

expr = re.compile(expr, re.X)
Community
  • 1
  • 1
georg
  • 211,518
  • 52
  • 313
  • 390
  • I'm not interested in this type of expedient, but thank you. I'm doing an extra check, if a webmaster do this in its pages, he deserve to be excluded from my list :) –  Nov 13 '12 at 15:23