1

I'm trying to detect valid Java annotations in a text. Here's my test program (I'm currently ignoring all whitespace for simplicity, I'll add this later):

txts = ['@SomeName2',                   # match
        '@SomeName2(',                  # no match
        '@SomeName2)',                  # no match 
        '@SomeName2()',                 # match
        '@SomeName2()()',               # no match
        '@SomeName2(value)',            # no match
        '@SomeName2(=)',                # no match
        '@SomeName2("")',               # match
        '@SomeName2(".")',              # no match
        '@SomeName2(",")',              # match
        '@SomeName2(value=)',           # no match
        '@SomeName2(value=")',          # no match
        '@SomeName2(=3)',               # no match
        '@SomeName2(="")',              # no match
        '@SomeName2(value=3)',          # match
        '@SomeName2(value=3L)',         # match
        '@SomeName2(value="")',         # match
        '@SomeName2(value=true)',       # match
        '@SomeName2(value=false)',      # match
        '@SomeName2(value=".")',        # no match
        '@SomeName2(value=",")',        # match
        '@SomeName2(x="o_nbr ASC, a")', # match

        # multiple params:
        '@SomeName2(,value="ord_nbr ASC, name")',                            # no match
        '@SomeName2(value="ord_nbr ASC, name",)',                            # no match
        '@SomeName2(value="ord_nbr ASC, name"insertable=false)',             # no match
        '@SomeName2(value="ord_nbr ASC, name",insertable=false)',            # match
        '@SomeName2(value="ord_nbr ASC, name",insertable=false,length=10L)', # match

        '@SomeName2 ( "ord_nbr ASC, name", insertable = false, length = 10L )',       # match
       ]


#regex = '((?:@[a-z][a-z0-9_]*))(\((((?:[a-z][a-z0-9_]*))(=)(\d+l?|"(?:[a-z0-9_, ]*)"|true|false))?\))?$'
#regex = '((?:@[a-z][a-z0-9_]*))(\((((?:[a-z][a-z0-9_]*))(=)(\d+l?|"(?:[a-z0-9_, ]*)"|true|false))?(,((?:[a-z][a-z0-9_]*))(=)(\d+l?|"(?:[a-z0-9_, ]*)"|true|false))*\))?$'

regex = r"""
    (?:@[a-z]\w*)                               # @ + identifier (class name)
    (
      \(                                        # opening parenthesis
        (
          (?:[a-z]\w*)                          # identifier (var name)
          =                                     # assigment operator
          (\d+l?|"(?:[a-z0-9_, ]*)"|true|false) # either a numeric | a quoted string containing only alphanumeric chars, _, space | true | false
        )?                                      # optional assignment group
      \)                                        # closing parenthesis
    )?$                                         # optional parentheses group (zero or one)
    """


rg = re.compile(regex, re.VERBOSE + re.IGNORECASE)

for txt in txts:
    m = rg.search(txt)
    #m = rg.match(txt)
    if m:
        print "MATCH:   ",
        output = ''
        for i in xrange(2):
            output = output + '[' + str(m.group(i+1)) + ']'
        print output
    else:
        print "NO MATCH: " + txt

So basically what I have seems to work for zero or one parameters. Now I'm trying to extend the syntax to zero or more parameters, like in the last example.

I then copied the regex part that represents the assignment and prepend it by a comma for the 2nd to nth group (this group now using * instead of ?):

regex = '((?:@[a-z][a-z0-9_]*))(\((((?:[a-z][a-z0-9_]*))(=)(\d+l?|"(?:[a-z0-9_, ]*)"|true|false))?(,((?:[a-z][a-z0-9_]*))(=)(\d+l?|"(?:[a-z0-9_, ]*)"|true|false))*\))?$'

That cannot work however. The problem seems to be how to handle the first element, because the it must be optional, then strings like the first extension example '@SomeName2(,value="ord_nbr ASC, name")' would be accepted, which is wrong. I have no idea how to make the 2nd to nth assignment depend only on the presence of the first (optional) element.

Can it be done? Is it done that way? How do you best solve this?

Thanks

Kawu
  • 13,647
  • 34
  • 123
  • 195
  • (regex part that represent the assignement)+ Will the above work for you, '+' indicates one or matches. Making the 2nd to Nth optional and dependent on first. – subiet Apr 09 '12 at 09:54
  • Sidenote: don't make a habit of using `re.IGNORECASE`. It's slower (not such a big deal) and is horrible when used with Unicode (a big deal.) – Li-aung Yip Apr 09 '12 at 11:37
  • Are you trying to detect any valid Java annotation, or a restricted subset? I can think of several perfectly valid Java annotations that your regex doesn't handle. – Luke Woodward Apr 09 '12 at 11:40
  • Just a subset! Nested annotations of course are a problem. It's about simple, non-recursing annotations. I need to check validity and extract the passed key-value pairs. – Kawu Apr 09 '12 at 11:47
  • Sidenote 2: The "**best** way to solve this" is to look for an *existing* parser library that already handles all the syntax quirks of Java annotations. Implementing parsers where pre-written parsers already exist is a losing game. (Even for simple formats like CSV files there are more corner cases than you would expect, hence the `csv` module.) – Li-aung Yip Apr 09 '12 at 11:49
  • I agree with @Li-aungYip. You're really pushing the limits of regex with a task like this. You need a parser. – alan Apr 09 '12 at 14:29

2 Answers2

2

If you're just trying to detect valid syntax, I believe the regex below will give you the matches you want. But I'm not sure what you are doing with the groups. Do you want each parameter value in its own group as well? That will be harder, and I'm not even sure it's even possible with regex.

regex = r'((?:@[a-z][a-z0-9_]*))(?:\((?!,)(?:(([a-z][a-z0-9_]*(=)(?:("[a-z0-9_, ]*")|(true|false)|(\d+l?))))(?!,\)),?)*\)(?!\()|$)'

If you need the individual parameters/values, you probably need to write a real parser for that.

EDIT: Here's a commented version. I also removed many of the capturing and non-capturing groups to make it easier to understand. If you use this with re.findall() it will return two groups: the function name, and all the params in parentheses:

regex = r'''
(@[a-z][a-z0-9_]*) # function name, captured in group
(                  # open capture group for all parameters
\(                 # opening function parenthesis 
  (?!,)            # negative lookahead for unwanted comma
  (?:              # open non-capturing group for all params
  [a-z][a-z0-9_]*  # parameter name
  =                # parameter assignmentoperators
  (?:"[a-z0-9_, ]*"|true|false|(?:\d+l?)) # possible parameter values
  (?!,\))          # negative lookahead for unwanted comma and closing parenthesis
  ,?               # optional comma, separating params
  )*               # close param non-capturing group, make it optional
\)                 # closing function parenthesis 
(?!\(\))           # negative lookahead for empty parentheses
|$                 # OR end-of-line (in case there are no params)
)                  # close capture group for all parameters
'''

After reading your comment about the parameters, the easiest thing will probably be to use the above regex to pull out all the parameters, then write another regex to pull out name/value pairs to do with as you wish. This will be tricky too, though, because there are commas in the parameter values. I'll leave that as an exercise for the reader :)

alan
  • 4,752
  • 21
  • 30
  • This seems to be working, thanks. Though I have trouble fully understanding it. The groups... I have no idea why I output them actually, they are a leftover of the example I started with. The only thing I need to do is detect which annotations are correct and extract the assignments (key-value pairs) from them (instantiate class `KeyValue("insertable", "true")` per pair). Note, that for the first arg/param it should be possible to not specify the key at all: `@BlaBla("one_two, three", insertable=true)` should MATCH in reality, too, but I tried to keep my question as simple as possible. – Kawu Apr 09 '12 at 11:18
  • Note that `'@SomeName2(value="ord_nbr ASC, name"insertable=false)'` matches, but shouldn't. (Forgot this as an example, sorry! It's basically about comma-separated lists as you know them from any language...) -> Question updated – Kawu Apr 09 '12 at 11:27
  • 1
    `re.VERBOSE`, people! Especially important to instructional purposes. ;) – Li-aung Yip Apr 09 '12 at 11:28
  • 1
    On [some more research](http://stackoverflow.com/questions/464736/python-regular-expressions-how-to-capture-multiple-groups-from-a-wildcard-expr), it's certainly possible to pull out the individual key-value pairs, using `re.findall()`. – Li-aung Yip Apr 09 '12 at 11:58
  • @Li-aungYip: I had every intention of coming back and commenting the regex, but had to leave quickly, and decided to go ahead and post it to see if it solved the problem. – alan Apr 09 '12 at 12:59
  • Well you did get in a full half-hour earlier than I did with a solution that seems to work, so points for that. ;) – Li-aung Yip Apr 09 '12 at 13:01
  • @Alan: thanks for the update. The next exercise for the "reader" will be to find out how these lookahead things are working... ;-) anyone have a recommendation on what to read BTW? – Kawu Apr 09 '12 at 14:18
  • @Kawu: my favourite recommendation for regex syntax is Chapter 8 of the TextWrangler user manual. (It's a text editor for Mac, but its manual happens to have a very good section on PCRE-compatible regex.) – Li-aung Yip Apr 09 '12 at 14:42
1

Use the re.VERBOSE flag

You've done some funny things here. Here's your original regex:

regex = '((?:@[a-z][a-z0-9_]*))(\((((?:[a-z][a-z0-9_]*))(=)(\d+l?|"
(?:[a-z0-9_, ]*)"|true|false))?\))?$'

For starters, use the re.VERBOSE flag so you can break this across multiple lines. This way whitespace and comments in the regular expression do not affect its meaning, so you can document what the regular expression is trying to do.

regex = re.compile("""
((?:@[a-z][a-z0-9_]*))     # Match starting symbol, @-sign followed by a word
(\(
    (((?:[a-z][a-z0-9_]*))                     # Match arguments??
    (=)(\d+l?|"(?:[a-z0-9_, ]*)"|true|false))? # ?????
\))?$
""", re.VERBOSE + re.IGNORECASE)

Since you haven't documented what this regex is trying to do, I cant decompose it any further. Document the intent of any non-trivial regular expression by using re.VERBOSE, breaking it across multiple lines, and commenting it.


Break the problem into manageable parts

Your regex is quite hard to understand because it's trying to do too much. As it stands, your regex is trying to do two things:

  1. Match a symbol name of the form @SomeSymbol2, optionally followed by a parenthesised list of arguments, (arg1="val1",arg2="val2"...)
  2. Validate the contents of the parenthesised argument list, so that (arg1="val1",arg2="val2") passes but (232,211) doesn't.

I would suggest breaking this into two parts, as below:

import re
import pprint

txts = [
        '@SomeName2',              # match
        '@SomeName2(',             # no match
        '@SomeName2)',             # no match 
        '@SomeName2()',            # match
        '@SomeName2()()',          # no match
        '@SomeName2(value)',       # no match
        '@SomeName2(=)',           # no match
        '@SomeName2("")',          # no match
        '@SomeName2(value=)',      # no match
        '@SomeName2(value=")',     # no match
        '@SomeName2(=3)',          # no match
        '@SomeName2(="")',         # no match
        '@SomeName2(value=3)',     # match
        '@SomeName2(value=3L)',    # match
        '@SomeName2(value="")',    # match
        '@SomeName2(value=true)',  # match
        '@SomeName2(value=false)', # match
        '@SomeName2(value=".")',   # no match
        '@SomeName2(value=",")',   # match
        '@SomeName2(value="ord_nbr ASC, name")', # match

        # extension needed!:
        '@SomeName2(,value="ord_nbr ASC, name")', # no match
        '@SomeName2(value="ord_nbr ASC, name",)', # no match
        '@SomeName2(value="ord_nbr ASC, name",insertable=false)'
        ] # no match YET, but should

# Regular expression to match overall @symbolname(parenthesised stuff)
regex_1 = re.compile( r"""
^                   # Start of string
(@[a-zA-Z]\w*)      # Matches initial token. Token name must start with a letter.
                    # Subsequent characters can be any of those matched by \w, being [a-zA-Z0-9_]
                    # Note behaviour of \w is LOCALE dependent.
( \( [^)]* \) )?    # Optionally, match parenthesised part containing zero or more characters
$                   # End of string
""", re.VERBOSE)

#Regular expression to validate contents of parentheses
regex_2 = re.compile( r"""
^
(
    ([a-zA-Z]\w*)       # argument key name (i.e. 'value' in the examples above)
    =                   # literal equals symbol
    (                   # acceptable arguments are:
        true  |         # literal "true"
        false |         # literal "false"
        \d+L? |         # integer (optionally followed by an 'L')
        "[^"]*"         # string (may not contain quote marks!)
    )
    \s*,?\s*            # optional comma and whitespace
)*                      # Match this entire regex zero or more times
$
""", re.VERBOSE)

for line in txts:
    print("\n")
    print(line)
    m1 = regex_1.search(line)    

    if m1:
        annotation_name, annotation_args = m1.groups()

        print "Symbol name   : ", annotation_name
        print "Argument list : ", annotation_args

        if annotation_args:
            s2 = annotation_args.strip("()")
            m2 = regex_2.search(s2)
            if (m2):
                pprint.pprint(m2.groups())
                print "MATCH"
            else:
                print "MATCH FAILED: regex_2 didn't match. Contents of parentheses were invalid."
        else:
            print "MATCH"

    else:
        print "MATCH FAILED: regex_1 didn't match."

This nearly gets you to a final solution. The only corner case I can see is that this (incorrectly) matches a trailing comma in the argument list. (You can check for this using a simple string operation, str.endswith().)


Edit Afterthought: The syntax for the argument list is actually pretty close to a real data format - you could probably feed argument_list to a JSON or YAML parser and it would tell you if it was good or not. Use the existing wheel (JSON parser) instead of reinventing the wheel, if you can.

This would allow, amongst other things -

  • Recognition of all argument types that Javascript supports, including floating point numbers and so on
  • Support for escaped quotes inside strings. Right now the regular expression will barf and die on "This is a quote mark: \"." because it thinks the second quote ends the string. (It doesn't.)

This can be done in regex, but it's horrible and complicated.

Li-aung Yip
  • 12,320
  • 5
  • 34
  • 49
  • The allowed values are pretty simple: `true|false||, MyEnum.WHATEVER`. No need for more actually. – Kawu Apr 09 '12 at 11:51
  • I didn't allow for `MyEnum.WHATEVER` - but it's pretty easy to see where you could add that in `regex_2`. (The joy of `re.VERBOSE`!) – Li-aung Yip Apr 09 '12 at 11:54
  • Yeah I know... I left the enum out for simplicity. – Kawu Apr 09 '12 at 12:01
  • 1
    Ah, I see. A closing note: You use `[a-z0-9_]` in your regex, which you could replace with `\w` (equivalent to `[A-Za-z0-9_]` if you're in an English locale.) – Li-aung Yip Apr 09 '12 at 12:08