1

I want to extract all Python functions/methods with their signatures from a Python project. I've tried:

$ grep -r ^def *

but this doesn't show full signatures when parameters span several lines. Any suggestions?

planetp
  • 14,248
  • 20
  • 86
  • 160
  • You could write another python code snippet with the `re` module. `import re; print re.findall('def.*?\)', open('file.py').read(), re.M)` – cs95 Jul 04 '16 at 09:04
  • 1
    Possible duplicate of [How can I search for a multiline pattern in a file?](http://stackoverflow.com/questions/152708/how-can-i-search-for-a-multiline-pattern-in-a-file) – Destrif Jul 04 '16 at 09:04
  • find . -iname '*.py' | xargs pcregrep -M '^def .*\n.*\\(.*\\)' – Destrif Jul 04 '16 at 09:05
  • @Shiva It just introduces more edge cases which are not handled. See for example `def f(abc=(), this_is_skipped=0)` and `definition=...` – viraptor Jul 04 '16 at 09:07
  • @Destrif This will skip the definitions that use only single line, or new lines between arguments. – viraptor Jul 04 '16 at 09:09
  • Oh yeah, didn't realise that was a possibility. – cs95 Jul 04 '16 at 09:09
  • 1
    Regex are not enough. That's because default arguments can have arbitrary nesting of parenthesis, and that is not a regular language. So a regex solution will *not* be robust no matter what and there will be cases where it fails. I would simply import the module in python and use the [`inspect`](https://docs.python.org/3/library/inspect.html) module to obtain [the signatures of the definitions](https://docs.python.org/3/library/inspect.html#introspecting-callables-with-the-signature-object). If for some reason you don't want to actually import the module, then `ast` is the solution. – Bakuriu Jul 04 '16 at 09:11
  • Alternatively, write docstrings and let [Sphinx](http://www.sphinx-doc.org/en/stable/) build nicely-formatted documentation for you. – jonrsharpe Jul 04 '16 at 09:15

3 Answers3

6

You can tokenize the file and use that to print function definitions:

import token
from tokenize import generate_tokens

def find_definitions(filename):
    with open(filename) as f:
        gen = generate_tokens(f.readline)
        for tok in gen:
            if tok[0] == token.NAME and tok[1] == 'def':
                # function definition, read until next colon.
                definition, last_line = [tok[-1]], tok[3][0]
                while not (tok[0] == token.OP and tok[1] == ':'):
                    if last_line != tok[3][0]:
                        # more than one line, append, track line number
                        definition.append(tok[-1])
                        last_line = tok[3][0]
                    tok = next(gen)
                if last_line != tok[3][0]:
                    definition.append(tok[-1])
                yield ''.join(definition)

This works regardless of how many lines a function definition uses.

Demo:

>>> import textwrap
>>> gen = find_definitions(textwrap.__file__.rstrip('c'))
>>> for definition in gen:
...     print(definition.rstrip())
...
    def __init__(self,
                 width=70,
                 initial_indent="",
                 subsequent_indent="",
                 expand_tabs=True,
                 replace_whitespace=True,
                 fix_sentence_endings=False,
                 break_long_words=True,
                 drop_whitespace=True,
                 break_on_hyphens=True):
    def _munge_whitespace(self, text):
    def _split(self, text):
    def _fix_sentence_endings(self, chunks):
    def _handle_long_word(self, reversed_chunks, cur_line, cur_len, width):
    def _wrap_chunks(self, chunks):
    def wrap(self, text):
    def fill(self, text):
def wrap(text, width=70, **kwargs):
def fill(text, width=70, **kwargs):
def dedent(text):

The above uses the textwrap module to demonstrate how it can handle multi-line definitions.

If you need to support Python 3 code with annotations, you'll need to be a little bit cleverer and track open and closing parens too; a colon within the parentheses doesn't count. On the other hand, Python 3 tokenize.tokenize() produces named tuples which make the function below a little easier to read:

import token
from tokenize import tokenize

def find_definitions(filename):
    with open(filename, 'rb') as f:
        gen = tokenize(f.readline)
        for tok in gen:               
            if tok.type == token.NAME and tok.string == 'def':
                # function definition, read until next colon outside
                # parentheses.
                definition, last_line = [tok.line], tok.end[0]
                parens = 0
                while tok.exact_type != token.COLON or parens > 0:
                    if last_line != tok.end[0]:
                        definition.append(tok.line)
                        last_line = tok.end[0]
                    if tok.exact_type == token.LPAR:
                        parens += 1
                    elif tok.exact_type == token.RPAR:
                        parens -= 1
                    tok = next(gen)
                if last_line != tok.end[0]:
                    definition.append(tok.line)
                yield ''.join(definition)

In Python 3 you'd preferably open source files in binary mode and let the tokenizer figure out the right encoding. Also, the above Python 3 version can tokenize Python 2 code without issue.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
1

You can parse the source using the ast module. It allows you to see exactly the same code structure the interpreter sees. You just need to traverse it and dump out any function definitions you find.

If you want to handle edge cases like multi-line declarations, bash/grep is not enough.

viraptor
  • 33,322
  • 10
  • 107
  • 191
  • 1
    `ast` is actually the wrong approach as you then have to map back into the source code for the lines. Go one step back in the process, only *tokenize*, as you then still have direct access to the line data. – Martijn Pieters Jul 04 '16 at 09:18
  • 1
    @MartijnPieters I disagree, AST provides `lineno` and `coloffset` on all the nodes (https://docs.python.org/2/library/ast.html#ast.AST.lineno), so the "map back" step is just `node.lineno`. Not that the tokenize solution is bad though :) But AST is not that different. – viraptor Jul 04 '16 at 10:06
  • 1
    Right, but that then requires you to add in the `linecache` module or read all of the source file into memory to get back to the source. And the `ast` module is far less flexible when it comes to handling older Python code; you can tokenize Python 2 code in Python 3, you can't produce an AST for it. – Martijn Pieters Jul 04 '16 at 10:27
1

This isn't a place to use regex in my opinion, unless you accept the fact that you'll potentially miss many edge cases.

Instead I'd suggest you use inspect and funcsigs (funcsigs is a backport of changes made in Python 3's inspect module. It includes the signature parsing functions).

Here's the file we'll parse (inspect_me.py):

import sys


def my_func(a, b=None):
    print a, b


def another_func(c):
    """
    doc comment
    """
    return c + 1

And here's the code that will parse it for us:

import inspect
from funcsigs import signature

import inspect_me


if __name__ == "__main__":
    # get all the "members" of our module:
    members = inspect.getmembers(inspect_me)
    for k, v in members:
        # we're only interested in functions for now (classes, vars, etc... may come later in a very similar fashion):
        if inspect.isfunction(v):
            # the name of our function:
            print k

            # the function signature as a string
            sig = signature(v)
            print str(sig)

            # let's strip out the doc string too:
            if inspect.getdoc(v):
                print "FOUND DOC COMMENT: %s" % (inspect.getdoc(v))

Inspect is the way to go about introspection in python. token and ast could both do the job but they're much more low level/complex than what you actually need here.

The output of running the above:

another_func
(c)
FOUND DOC COMMENT: doc comment
my_func
(a, b=None)
smassey
  • 5,875
  • 24
  • 37
  • This requires that the code is imported, which *can* have side effects and is rather heavy in terms of processing (as all the function and class definitions are loaded). This also requires that the Python code is on `sys.path`; you can't just point this script to a directory and have it print out all the function definitions. – Martijn Pieters Jul 04 '16 at 09:25
  • @MartijnPieters for sure. You could call that a disadvantage, I'd call it an advantage: you can now inspect dynamically created classes that don't physically exist in your file but are created on the fly. Meta fun stuff :) If performance is an issue.. that's not the fault of the parsing: the problem was obviously there long before you ran your inspector :P – smassey Jul 04 '16 at 09:29
  • Well, unless you are using `types.FunctionType` to create function objects from scratch, tokenizing can handle dynamic definitions too. – Martijn Pieters Jul 04 '16 at 09:40
  • I was thinking in terms of future requirements: show classes, obj constructors, module comments, any and everything created via `type()` etc. Tokens are great if you know exactly what you're looking for in advance and don't expect things to change. IMHO they'll become a huge PITA to maintain once you start adding new requirements. Not to mention: custom token parsers will need to change with the changes to the language where some abstracted introspection doesn't. – smassey Jul 04 '16 at 10:09
  • Actually, tokenizing is quite flexible; the tokenizer in Python 3 can handle Python 2 syntax just fine, while you can't *import* Python 2 code into 3. Handling docstrings and classes isn't too hard; just track `INDENT` and `DEDENT` tokens. – Martijn Pieters Jul 04 '16 at 10:25