3

I'm trying to write a regex that finds all variables (and only variables, ignoring methods completely) in a given piece of JavaScript code. The actual code (the one which executes regex) is written in Java.

For now, I've got something like this:

Matcher matcher=Pattern.compile(".*?([a-z]+\\w*?).*?").matcher(string);
while(matcher.find()) {
    System.out.println(matcher.group(1));
}

So, when value of "string" is variable*func()*20

printout is:

variable
func

Which is not what I want. The simple negation of ( won't do, because it makes regex catch unnecessary characters or cuts them off, but still functions are captured. For now, I have the following code:

Matcher matcher=Pattern.compile(".*?(([a-z]+\\w*)(\\(?)).*?").matcher(formula);
while(matcher.find()) {
    if(matcher.group(3).isEmpty()) {
        System.out.println(matcher.group(2));
    }
}

It works, the printout is correct, but I don't like the additional check. Any ideas? Please?

EDIT (2011-04-12):

Thank you for all answers. There were questions, why would I need something like that. And you are right, in case of bigger, more complicated scripts, the only sane solution would be parsing them. In my case, however, this would be excessive. The scraps of JS I'm working on are intented to be simple formulas, something like (a+b)/2. No comments, string literals, arrays, etc. Only variables and (probably) some built-in functions. I need variables list to check if they can be initalized and this point (and initialized at all). I realize that all of it can be done manually with RPN as well (which would be safer), but these formulas are going to be wrapped with bigger script and evaluated in web browser, so it's more convenient this way.

This may be a bit dirty, but it's assumed that whoever is writing these formulas (probably me, for most of the time), knows what is doing and is able to check if they are working correctly.

If anyone finds this question, wanting to do something similar, should now the risks/difficulties. I do, at least I hope so ;)

genobis
  • 1,081
  • 9
  • 13
  • 1
    It's not exactly the same, however, take a look at this asnwer: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 regular expressions are not panacea ! – Serafeim Apr 11 '11 at 13:21

4 Answers4

1

It's pretty well established that regex cannot be reliably used to parse structured input. See here for the famous response: RegEx match open tags except XHTML self-contained tags

As any given sequence of characters may or may not change meaning depending on previous or subsequent sequences of characters, you cannot reliably identify a syntactic element without both lexing and parsing the input text. Regex can be used for the former (breaking an input stream into tokens), but cannot be used reliably for the latter (assigning meaning to tokens depending on their position in the stream).

Community
  • 1
  • 1
AndyT
  • 1,413
  • 9
  • 11
  • I haven't seen that, the best response ever! And yes, I think you are right. Other posters mentioned some parsers, but I think it would be a bit excessive in my case... – genobis Apr 12 '11 at 08:49
1

If you are rethinking using regex and wondering what else you could do, you could consider using an AST instead to access your source programatically. This answer shows you could use the Eclipse Java AST to build a syntax tree for Java source. I guess you could do similar for Javascript.

Community
  • 1
  • 1
ewan.chalmers
  • 16,145
  • 43
  • 60
  • I've googled a bit and I see that I could, indeed ( http://help.eclipse.org/helios/index.jsp?topic=/org.eclipse.wst.jsdt.doc/reference/api/org/eclipse/wst/jsdt/core/dom/AST.html ). This might be too heavy for my current needs, but thank you - I didn't know that and knowing myself, I'll need that sooner than later :) – genobis Apr 12 '11 at 08:39
1

A regex won't cut in this case because Java isn't regular. Your best best is to get a parser that understands Java syntax and build onto that. Luckily, ANTLR has a Java 1.6 grammar (and 1.5 grammar).

For your rather limited use case you could probably easily extend the variable assignment rules and get the info you need. It's a bit of a learning curve but this will probably be your best best for a quick and accurate solution.

Andrew White
  • 52,720
  • 19
  • 113
  • 137
  • The original question was for JavaScript, not java. ANTLR also has a JS grammar (here: http://www.antlr.org/grammar/1206736738015/JavaScript.g). As always though, when the answer starts getting extremely complex, I suggest that the developer asks themselves if they'd asked the right question in the first place. @genobis - Why do you need to do this? – AndyT Apr 11 '11 at 14:06
  • @genobis - See my answer for the reasons why Regex won't work. – AndyT Apr 11 '11 at 16:46
1

Taking all the sound advice about how regex is not the best tool for the job into consideration is important. But you might get away with a quick and dirty regex if your rule is simple enough (and you are aware of the limitations of that rule):

Pattern regex = Pattern.compile(
    "\\b     # word boundary\n" +
    "[A-Za-z]# 1 ASCII letter\n" +
    "\\w*    # 0+ alnums\n" +
    "\\b     # word boundary\n" +
    "(?!     # Lookahead assertion: Make sure there is no...\n" +
    " \\s*   # optional whitespace\n" +
    " \\(    # opening parenthesis\n" +
    ")       # ...at this position in the string", 
    Pattern.COMMENTS);

This matches an identifier as long as it's not followed by a parenthesis. Of course, now you need group(0) instead of group(1). And of course this matches lots of other stuff (inside strings, comments, etc.)...

Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561
  • Thank you very much! It's exactly what I need, and your answer not only solves my problem, but is very informative. I feel a bit wiser now :) It will catch other stuff, but - as I stated in updated question - it's acceptable in my case. – genobis Apr 12 '11 at 08:44