python regex to remove comments

Question

How would I write a regex that removes all comments that start with the # and stop at the end of the line -- but at the same time exclude the first two lines which say

#!/usr/bin/python

and

#-*- coding: utf-8 -*-

Comments don't slow your code down. Why do you want to remove them? — agf, Aug 11 '11 at 20:03
You don't :). At least, not with a simple regex. Consider the following: `s = 'not # a # comment!'`, or this: `s = """ \n foo # \n bar """` (where `\n` are actual line breaks) — Bart Kiers, Aug 11 '11 at 20:06
@agf, to make things more difficult for the next person to work on the code! — bgw, Aug 11 '11 at 20:06
This question is similar to http://stackoverflow.com/q/1621521 , where there is already a (not entirely regex) solution that may satisfy your needs — bgw, Aug 11 '11 at 20:13

unutbu · Accepted Answer · 2019-02-11T14:02:17.670

5

You can remove comments by parsing the Python code with tokenize.generate_tokens. The following is a slightly modified version of this example from the docs:

import tokenize
import io
import sys
if sys.version_info[0] == 3:
    StringIO = io.StringIO
else:
    StringIO = io.BytesIO

def nocomment(s):
    result = []
    g = tokenize.generate_tokens(StringIO(s).readline)  
    for toknum, tokval, _, _, _  in g:
        # print(toknum,tokval)
        if toknum != tokenize.COMMENT:
            result.append((toknum, tokval))
    return tokenize.untokenize(result)

with open('script.py','r') as f:
    content=f.read()

print(nocomment(content))

For example:

If script.py contains

def foo(): # Remove this comment
    ''' But do not remove this #1 docstring 
    '''
    # Another comment
    pass

then the output of nocomment is

def foo ():
    ''' But do not remove this #1 docstring 
    '''

    pass

edited Feb 11 '19 at 14:02

answered Aug 11 '11 at 20:37

unutbu

842,883
184
1,785
1,677

I'm just curious: How well does this handle stuff like extra whitespace? – bgw Aug 12 '11 at 07:41
2

@PiPeep: For an example of how tokenize can handle whitespace, see [reindent.py](http://svn.python.org/projects/python/trunk/Tools/scripts/reindent.py). – unutbu Aug 12 '11 at 09:35
i think your code need updation, which is now giving error, `File "/usr/lib/python3.6/tokenize.py", line 565, in _tokenize if line[pos] in '#\r\n': # skip comments or blank lines TypeError: 'in ' requires string as left operand, not int` in the library itself – Nihal Feb 11 '19 at 13:20
your code works in python 2.7, but not in python 3.6 – Nihal Feb 11 '19 at 13:23
Updated for Python3. – unutbu Feb 11 '19 at 14:02

bgw · Answer 2 · 2011-08-11T20:09:04.063

1

I don't actually think this can be done purely with a regex expression, as you'd need to count quotes to ensure that an instance of # isn't inside of a string.

I'd look into python's built-in code parsing modules for help with something like this.

edited Aug 11 '11 at 20:09

answered Aug 11 '11 at 20:01

bgw

2,036
1
19
28

Boldewyn · Answer 3 · 2011-08-11T20:12:18.430

1

sed -e '1,2p' -e '/^\s*#/d' infile

Then wrap this in a subprocess.Popen call.

However, this doesn't substitute a real parser! Why would this be of interest? Well, assume this Python script:

output = """
This is
#1 of 100"""

Boom, any non-parsing solution instantly breaks your script.

edited Aug 11 '11 at 20:12

answered Aug 11 '11 at 20:02

Boldewyn

81,211
44
156
212

Why not just use the python `re` package in the example, rather than requiring a platform-dependent tool? – bgw Aug 11 '11 at 20:22

python regex to remove comments

3 Answers3

Linked

Related