Extracting comments from Python Source Code

Question

I'm trying to write a program to extract comments in code that user enters. I tried to use regex, but found it difficult to write.

Then I found a post here. The answer suggests to use tokenize.generate_tokens to analyze the grammar, but the documentation says:

The generate_tokens() generator requires one argument, readline, which must be a callable object which provides the same interface as the readline() method of built-in file objects (see section File Objects).

But a string object does not have readline method.

Then I found another post here, suggesting to use StringIO.StringIO to get a readline method. So I wrote the following code:

import tokenize
import io
import StringIO

def extract(code):
    res = []
    comment = None
    stringio = StringIO.StringIO(code)
    for toktype, tokval, begin, end, line in tokenize.generate_tokens(stringio):
        # print(toknum,tokval)
        if toktype != tokenize.COMMENT:
            res.append((toktype, tokval))
        else:
            print tokenize.untokenize(toktype)
    return tokenize.untokenize(res)

And entered the following code: extract('a = 1+2#A Comment')

But got:

Traceback (most recent call last):     
   File "<stdin>", line 1, in <module>     
   File "ext.py", line 10, in extract     
     for toktype, tokval, begin, end, line in tokenize.generate_tokens(stringio):     
   File "C:\Python27\lib\tokenize.py", line 294, in generate_tokens     
     line = readline()     
AttributeError: StringIO instance has no `__call__` method

I know I can write a new class, but is there any better solution?

You're onto the right solution. Please show some code so we can help you make it work. — zmo, Dec 29 '15 at 13:11

Dimitris Fasarakis Hilliard · Accepted Answer · 2019-02-07T13:02:25.610

Answer for more general cases (extracting from modules, functions):

Modules:

The documentation specifies that one needs to provide a callable which exposes the same interface as the readline() method of built-in file objects. This hints to: create an object that provides that method.

In the case of module, we can just open a new module as a normal file and pass in it's readline method. This is the key, the argument you pass is the method readline().

Given a small scrpt.py file with:

# My amazing foo function.
def foo():
    """ docstring """
    # I will print
    print "Hello"
    return 0   # Return the value

# Maaaaaaain
if __name__ == "__main__":
    # this is main
    print "Main"

We will open it as we do all files:

fileObj = open('scrpt.py', 'r')

This file object now has a method called readline (because it is a file object) which we can safely pass to tokenize.generate_tokens and create a generator.

tokenize.generate_tokens (simply tokenize.tokenize in Py3 -- Note: Python 3 requires readline return bytes so you'll need to open the file in 'rb' mode) returns a named tuple of elements which contain information about the elements tokenized. Here's a small demo:

for toktype, tok, start, end, line in tokenize.generate_tokens(fileObj.readline):
    # we can also use token.tok_name[toktype] instead of 'COMMENT'
    # from the token module 
    if toktype == tokenize.COMMENT:
        print 'COMMENT' + " " + tok

Notice how we pass the fileObj.readline method to it. This will now print:

COMMENT # My amazing foo function
COMMENT # I will print
COMMENT # Return the value
COMMENT # Maaaaaaain
COMMENT # this is main

So all comments regardless of position are detected. Docstrings of course are excluded.

Functions:

You could achieve a similar result without open for cases which I really can't think of. Nonetheless, I'll present another way of doing it for completeness sake. In this scenario you'll need two additional modules, inspect and StringIO (io.StringIO in Python3):

Let's say you have the following function:

def bar():
    # I am bar
    print "I really am bar"
    # bar bar bar baaaar
    # (bar)
    return "Bar"

You need a file-like object which has a readline method to use it with tokenize. Well, you can create a file-like object from an str using StringIO.StringIO and you can get an str representing the source of the function with inspect.getsource(func). In code:

funcText = inpsect.getsource(bar)
funcFile = StringIO.StringIO(funcText)

Now we have a file-like object representing the function which has the wanted readline method. We can just re-use the loop we previously performed replacing fileObj.readline with funcFile.readline. The output we get now is of similar nature:

COMMENT # I am bar
COMMENT # bar bar bar baaaar
COMMENT # (bar)

As an aside, if you really want to create a custom way of doing this with re take a look at the source for the tokenize.py module. It defines certain patters for comments, (r'#[^\r\n]*') names et cetera, loops through the lines with readline and searches within the line list for pattterns. Thankfully, it's not too complex after you look at it for a while :-).

Answer for function `extract` (Update):

You've created an object with StringIO that provides the interface but have you haven't passed that intereface (readline) to tokenize.generate_tokens, instead, you passed the full object (stringio).

Additionally, in your else clause a TypeError is going to be raised because untokenize expects an iterable as input. Making the following changes, your function works fine:

def extract(code):
    res = []
    comment = None
    stringio = StringIO.StringIO(code)
    # pass in stringio.readline to generate_tokens
    for toktype, tokval, begin, end, line in tokenize.generate_tokens(stringio.readline):
        if toktype != tokenize.COMMENT:
            res.append((toktype, tokval))
        else:
            # wrap (toktype, tokval) tupple in list
            print tokenize.untokenize([(toktype, tokval)])
    return tokenize.untokenize(res)

Supplied with input of the form expr = extract('a=1+2#A comment') the function will print out the comment and retain the expression in expr:

expr = extract('a=1+2#A comment')
#A comment

print expr
'a =1 +2 '

Furthermore, as I later mention io houses StringIO for Python3 so in this case the import is thankfully not required.

(python 3.6)Passing the fileObject.readline directly to tokenize.tokenize will only work if it is opened in mode 'rb' since I guess tokenize expects bytestring from readline(). — philoj, Feb 07 '19 at 12:49
@philoj it does. Apparently in all Python > 3 versions. Good catch! — Dimitris Fasarakis Hilliard, Feb 07 '19 at 13:01

score 1 · Answer 2 · answered Jun 24 '21 at 08:45

1

Use this Third-Party Library from PyPI

Comment Parser

answered Jun 24 '21 at 08:45

Shedrack

656
7
22

Extracting comments from Python Source Code

2 Answers2

Answer for more general cases (extracting from modules, functions):

Modules:

Functions:

Answer for function `extract` (Update):

Linked

Extracting comments from Python Source Code

2 Answers2

Answer for more general cases (extracting from modules, functions):

Modules:

Functions:

Answer for function extract (Update):

Linked

Answer for function `extract` (Update):