0

I'm working on a file parser that needs to cut out comments from JavaScript code. The thing is it has to be smart so it won't take '//' sequence inside string as the beggining of the comment. I have following idea to do it:

Iterate through lines. Find '//' sequence first, then find all strings surrounded with quotes ( ' or ") in line and then iterate through all string matches to check if the '//' sequence is inside or outside one of those strings. If it is outside of them it's obvious that it'll be a proper comment begining.

When testing code on following line (part of bigger js file of course):

document.getElementById("URL_LABEL").innerHTML="<a name=\"link\" href=\"http://"+url+"\" target=\"blank\">"+url+"</a>";

I've encountered problem. My regular expression code:

re_strings=re.compile("""   "
                            (?:
                            \\.|
                            [^\\"]
                            )*
                            "
                            |
                            '
                            (?:
                                [^\\']|
                                \\.
                            )*
                            '
                            """,re.VERBOSE);


for s in re.finditer(re_strings,line):
            print(s.group(0))

In python 3.2.3 (and 3.1.4) returns the following strings:

"URL_LABEL"
"<a name=\"
" href=\"
"+url+"
" target=\"
">"
"</a>"

Which is obviously wrong because \" should not exit the string. I've been debugging my regex for quite a long time and it SHOULDN'T exit here. So i used RegexBuddy (with Python compatibility) and Python regex tester at http://re-try.appspot.com/ for reference. The most peculiar thing is they both return same, correct results other than my code, that is:

"URL_LABEL"
"<a name=\"link\" href=\"http://"
"\" target=\"blank\">"
"</a>"

My question is what is the cause of those differences? What have I overlooked? I'm rather a beginer in both Python and regular expressions so maybe the answer is simple...

P.S. I know that finding if the '//' sequence is inside string quotes can be accomplished with one, bigger regex. I've already tried it and met the same problem.

P.P.S I would like to know what I'm doing wrong, why there are differences in behaviour of my code and regex test applications, not find other ideas how to parse JavaScript code.

Wookie88
  • 33,079
  • 4
  • 27
  • 32

2 Answers2

2

You just need to use a raw string to create the regex:

re_strings=re.compile(r"""   "
                             etc.
                             "
                        """,re.VERBOSE);

The way you've got it, \\.|[^\\"] becomes the regex \.|[^\"], which matches a literal dot (.) or anything that's not a quotation mark ("). Add the r prefix to the string literal and it works as you intended.

See the demo here. (I also used a raw string to make sure the backslashes appeared in the target string. I don't know how you arranged that in your tests, but the backslashes obviously are present; the problem is that they're missing from your regex.)

Alan Moore
  • 73,866
  • 12
  • 100
  • 156
  • Wow, that's really a simple mistake that I have made. I thought that triple-quote string already is a raw string. Thanks for the link-this online tool could be pretty handy. – Wookie88 Aug 31 '12 at 09:06
1

you cannot deal with matching quotes with regex ... in fact you cannot guarantee any matching pairs of anything(and nested pairs especially) ... you need a more sophisticated statemachine for that(LLVM, etc...)

source: lots of CS classes...

and also see : Matching pair tag with regex for a more detailed explanation

I know its not what you wanted to hear but its basically just the way it is ... and yes different implementations of regex can return different results for stuff that regex cant really do

Community
  • 1
  • 1
Joran Beasley
  • 110,522
  • 12
  • 160
  • 179
  • I've also attended to some CS classes, but maybe I forgot something. As far as I know regex is a state machine, when it meets OR it tries to match branches in left-to-right order (that is in Python-I agree that it CAN vary between languages). So when you are looking for a single quoted string you can write something like `"(\\.|[^\\"])*"` and it should work. I don't see here any chances to multi-interpret this regex-inside string if you meet `\ ` it must be followed by any other character, so if it's `"`, we are stil in `(...)*` that must be ended with `"`. Please correct me if I'm wrong. – Wookie88 Aug 30 '12 at 23:19
  • I dunno I think you are right ... but all i remember from class about this was that you absolutely cannot match nested stuff(quotes were main example)... it was drilled into us pretty hard... – Joran Beasley Aug 30 '12 at 23:28
  • 2
    Actually, you can say "one FOO", "followed by 0 or more of NO FOO", "followed by one FOO". But you cannot deal with nested things and dealing with escaped foos gets relaly hairy and what the question's asker actually wants is a full-fledged parses. – Vatine Aug 31 '12 at 00:46
  • 1
    Yeah, nesting with regex is quite hard and can lots of resources. I've encountered before a problem that Python couldn't finish the findall method for whole file, so I started iterating through lines and used non-regex method for telling if ` // ` is outside quotes. – Wookie88 Aug 31 '12 at 09:09