2

How would I write a regex that removes all comments that start with the # and stop at the end of the line -- but at the same time exclude the first two lines which say

#!/usr/bin/python 

and

#-*- coding: utf-8 -*-
agf
  • 171,228
  • 44
  • 289
  • 238
captainandcoke
  • 1,085
  • 2
  • 13
  • 16
  • 3
    Comments don't slow your code down. Why do you want to remove them? – agf Aug 11 '11 at 20:03
  • You don't :). At least, not with a simple regex. Consider the following: `s = 'not # a # comment!'`, or this: `s = """ \n foo # \n bar """` (where `\n` are actual line breaks) – Bart Kiers Aug 11 '11 at 20:06
  • @agf, to make things more difficult for the next person to work on the code! – bgw Aug 11 '11 at 20:06
  • 2
    This question is similar to http://stackoverflow.com/q/1621521 , where there is already a (not entirely regex) solution that may satisfy your needs – bgw Aug 11 '11 at 20:13

3 Answers3

5

You can remove comments by parsing the Python code with tokenize.generate_tokens. The following is a slightly modified version of this example from the docs:

import tokenize
import io
import sys
if sys.version_info[0] == 3:
    StringIO = io.StringIO
else:
    StringIO = io.BytesIO

def nocomment(s):
    result = []
    g = tokenize.generate_tokens(StringIO(s).readline)  
    for toknum, tokval, _, _, _  in g:
        # print(toknum,tokval)
        if toknum != tokenize.COMMENT:
            result.append((toknum, tokval))
    return tokenize.untokenize(result)

with open('script.py','r') as f:
    content=f.read()

print(nocomment(content))

For example:

If script.py contains

def foo(): # Remove this comment
    ''' But do not remove this #1 docstring 
    '''
    # Another comment
    pass

then the output of nocomment is

def foo ():
    ''' But do not remove this #1 docstring 
    '''

    pass 
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • I'm just curious: How well does this handle stuff like extra whitespace? – bgw Aug 12 '11 at 07:41
  • 2
    @PiPeep: For an example of how tokenize can handle whitespace, see [reindent.py](http://svn.python.org/projects/python/trunk/Tools/scripts/reindent.py). – unutbu Aug 12 '11 at 09:35
  • i think your code need updation, which is now giving error, `File "/usr/lib/python3.6/tokenize.py", line 565, in _tokenize if line[pos] in '#\r\n': # skip comments or blank lines TypeError: 'in ' requires string as left operand, not int` in the library itself – Nihal Feb 11 '19 at 13:20
  • your code works in python 2.7, but not in python 3.6 – Nihal Feb 11 '19 at 13:23
  • Updated for Python3. – unutbu Feb 11 '19 at 14:02
1

I don't actually think this can be done purely with a regex expression, as you'd need to count quotes to ensure that an instance of # isn't inside of a string.

I'd look into python's built-in code parsing modules for help with something like this.

bgw
  • 2,036
  • 1
  • 19
  • 28
1
sed -e '1,2p' -e '/^\s*#/d' infile

Then wrap this in a subprocess.Popen call.

However, this doesn't substitute a real parser! Why would this be of interest? Well, assume this Python script:

output = """
This is
#1 of 100"""

Boom, any non-parsing solution instantly breaks your script.

Boldewyn
  • 81,211
  • 44
  • 156
  • 212
  • Why not just use the python `re` package in the example, rather than requiring a platform-dependent tool? – bgw Aug 11 '11 at 20:22