2

In order to write a custom documentation generator for my Python code, I'd like to write a regular expression capable of matching to following:

def my_function(arg1,arg2,arg3):
"""
DOC
"""

My current problem is that, using the following regex:

def (.+)\((?:(\w+),)*(\w+)\)

I can only match my_function, arg2 and arg3 (according to Pythex.org).

I don't understand what I'm doing wrong, since my (?:(\w+),)* should match as many arguments as possible, until the last one (here arg3). Could anyone explain?

Thanks

Thomas Kowalski
  • 1,995
  • 4
  • 20
  • 42

1 Answers1

5

This isn't possible in a general sense because Python functions are not regular expressions -- they can take on forms that can't be captured with regular expression syntax, especially in the context of OTHER Python structures. But take heart, there is a lot to learn from your question!

The fascinating thing is that although you said you're trying to learn regular expressions, you accidentally stumbled into the very heart of computer science itself, compiler theory!

I'm going to address less than a fraction of the tip of the iceberg in this post to help get you started, and then suggest a few free and paid resources to help you continue.

A python function by itself may take on several forms:

def foo(x):
    "docstring"
    <body>

def foo1(x):
    """doc
    string"""
    <body>

def foo2(x):
    <body>

Additionally, what comes before and after the function may not be another function!

This is what would make using a regex by itself impossible to autogenerate documentation (well, not possible for me. I'm not smart enough to write a single regular expression than can account for the entire Python language!).

What you need to look into is parsing (by the way, I'm using the term parsing very loosely to cover parsing, tokenizing, and lexing just to keep things "simple") Using regular expressions are typically a very important part of parsing.

The general strategy would be to parse the file into syntactic constructs. Identify which of those constructs are functions. Isolate the text of that function. THEN you can use a regular expression to parse the text of that construct. OR you can parse one level further and break up the function into distinct syntactic constructions -- function name, parameter declaration, doc string, body, etc... at which point your problem will be solved.

I was attempting to write a regular expression for a standard function definition (without parsing) like foo or foo1 but I was struggling to do-so even having written a few languages.

So just to be clear, the point that I would think about parsing as opposed to simple regex is any time your input spans multiple lines. Regex is most effective on single lines.

A parsing function looks like this:

def parse_fn_definition(definition):

    def parse_name(lines):
       <code>

    def parse_args(lines):
       <code>

    def parse_doc(lines):
       <code>

    def parse_body(lines):
       <code>
    ...

Now here's the real trick: Each of these parse functions returns two things:

0) The chunk of parsed regex 1) The REST of line

so for instance,

def parse_name(lines):
    pattern = 'def\s*?([a-zA-Z_][a-zA-Z0-9_]*?)'
    for line in lines:
        m = re.match(pattern, line)
        if m:
            res, rest = m.groups()
            return res, [rest] + lines
        else:
            raise Exception("Line cannot be parsed by parse_name: {}".format(line))

So, once you've isolated the function text (that's a whole other set of tricks to do, usually involves creating something called a "grammar" -- don't worry, I set you up with some resources down below), you can parse the function text with the following technique:

def parse_fn(lines_of_text):
    name, rest = parse_name(lines_of_text)
    params, rest = parse_params(rest)
    doc_string, rest = parse_doc(rest)
    body, rest = parse_body(rest)
    function = [name, params, doc_string, body]
    res = function, rest
    return res

This function would return some data structure that represents the function (I just used a simple list for illustration) and the rest of the lines of text. That would get passed on to something that will appropriately catalog your function data and then classify and process the rest of the text!

Anyway, if this is something that interests you, don't give up! I would offer a few humble suggestions:

1) Start with an EASIER language to parse, like Scheme/LISP. These languages were designed to be easy to parse and manipulate! Then work your way up to more irregular languages.

2a) Peter Norvig has done some amazing and very accessible work on this. Check out Lispy!

2b) Peter Norvig's class CS212 (specifically unit 3 code) is very challenging but does an excellent job introducing fundamental language design concepts. Every job I've ever gotten, and my love for programming, is because of that course.

3) If you want to advance yourself even further and you can afford it, I would strongly recommend checking out Dave Beazely's workshops on compilers or interpreters. I've taken two courses from Dave, and while I can't promise this for everyone, my salary has literally doubled after each course, so I think it's a worthwhile investment.

4) Absolutely check out Structure and Interpretation of Computer Programs (the wizard book) and Compilers (the dragon book). They'll change your life.

5) DON'T GIVE UP! YOU GOT THIS!! Good luck to you!

  • IIRC, Lisp wasn't designed to be easy to parse; they wanted to make a more traditional syntax, but S-expressions were easier to implement, and they never had a compelling reason to go through with the original plan. – user2357112 Apr 11 '18 at 17:52
  • Interesting, I didn't know that! Lucky for us! – snakes_on_a_keyboard Apr 11 '18 at 18:00
  • @user2357112 Well, McCarthy and Russell _did_ design sexprs to be easy to parse, including the idea that the AST is the representation of the AST. And mexprs were supposed to be easily translatable to sexprs (as in easy to do in your head, not just for the code). The fact that nobody found mexprs any easier to program in than sexprs, so they were eventually abandoned, is the lucky accident. (And it's not even entirely an accident; Russell pushed that idea from the start.) – abarnert Apr 11 '18 at 18:01
  • 1
    It seems to be a universal principle that all this work we do -- lambda calculus, category theory, all the way down -- is because, at heart, we're very, very lazy and want to minimize the work/thinking we have to do in the long run :D Love this field – snakes_on_a_keyboard Apr 11 '18 at 18:04
  • 2
    @snakes_on_a_keyboard The difference between a great software engineer and a good one is that the great one is lazier. Unfortunately, the difference between a terrible software engineer and a bad one is that the terrible one is lazier. It's all about being lazy in the right ways… – abarnert Apr 11 '18 at 18:14