11

I am trying to parse some docstrings.

An example docstrings is:

Test if a column field is larger than a given value
    This function can also be called as an operator using the '>' syntax

    Arguments:
        - DbColumn self
        - string or float value: the value to compare to
            in case of string: lexicographic comparison
            in case of float: numeric comparison
    Returns:
        DbWhere object

Both the Arguments and Returns parts are optional. I want my regex to return as groups the description (first lines), the Arguments part (if present) and the Returns part (if present).

The regex I have now is:

m = re.search('(.*)(Arguments:.*)(Returns:.*)', s, re.DOTALL)

and works in case all three parts are present but fails as soon as Arguments or the Returnsparts are not available. I have tried several variations with the non-greedy modifiers like ??but to no avail.

Edit: When the Arguments and Returns parts are present, I actually would only like to match the text after Arguments: and Returns: respectively.

Thanks!

BioGeek
  • 21,897
  • 23
  • 83
  • 145

2 Answers2

12

Try with:

re.search('^(.*?)(Arguments:.*?)?(Returns:.*)?$', s, re.DOTALL)

Just making the second and third groups optional by appending a ?, and making the qualifiers of the first two groups non-greedy by (again) appending a ? on them (yes, confusing).

Also, if you use the non-greedy modifier on the first group of the pattern, it'll match the shortest possible substring, which for .* is the empty string. You can overcome this by adding the end-of-line character ($) at the end of the pattern, which forces the first group to match as few characters as possible to satisfy the pattern, i.e. the whole string when there's no Arguments and no Returns sections, and everything before those sections, when present.

Edit: OK, if you just want to capture the text after the Arguments: and Returns: tokens, you'll have to tuck in a couple more groups. We're not going to use all of the groups, so naming them —with the <?P<name> notation (another question mark, argh!)— is starting to make sense:

>>> m = re.search('^(?P<description>.*?)(Arguments:(?P<arguments>.*?))?(Returns:(?P<returns>.*))?$', s, re.DOTALL)
>>> m.groupdict()['description']
"Test if a column field is larger than a given value\n    This function can also be called as an operator using the '>' syntax\n\n    "
>>> m.groupdict()['arguments']
'\n        - DbColumn self\n        - string or float value: the value to compare to\n            in case of string: lexicographic comparison\n            in case of float: numeric comparison\n    '
>>> m.groupdict()['returns']
'\n        DbWhere object'
>>>
Chewie
  • 7,095
  • 5
  • 29
  • 36
  • Works like a charm! How would you modify the regex if, for the optional parts, I only want to match the text after `Arguments` and `Returns`? – BioGeek Jan 26 '12 at 14:15
  • Something like `re.search('^(.*?)(Arguments:(.*?))?(Returns:(.*))?$', doc, re.DOTALL)` works, but I don't care for the second and fourth group it returns. – BioGeek Jan 26 '12 at 14:31
  • I've edited my answer. Just name the groups and forget about `groups()`, use `groupdict()` instead. – Chewie Jan 26 '12 at 15:22
5

If you want to match the text after optional Arguments: and Returns: sections, AND you don't want to use (?P<name>...) to name your capture groups, you can also use, (?:...), the non-capturing version of regular parentheses.

The regex would look like this:

m = re.search('^(.*?)(?:Arguments:(.*?))?(?:Returns:(.*?))?$', doc, re.DOTALL)
#                     ^^                  ^^

According to the Python3 documentation:

(?:...)

A non-capturing version of regular parentheses. Matches whatever regular expression is inside the parentheses, but the substring matched by the group cannot be retrieved after performing a match or referenced later in the pattern.

Community
  • 1
  • 1
PythonJin
  • 4,034
  • 4
  • 32
  • 40