Removing string literals that really are comments from python source code?

Question

I need to quickly write (or borrow) something, in any language, that automatically filters tons of python source code in order to remove comments. The goal is to make the code on the target platform more compact (and as an aside reverse engineering even so slightly more difficult). I must positively not modify the code's behavior, and can live with a few leftover comments. My input and output should be a .py text file, assumed to be valid python 2.x (assume: restricted to ASCII, I'll take care of UTF8).

Strictly speaking, I do not need to remove comments of the kind defined by

A comment starts with a hash character (#) that is not part of a string literal, and ends at the end of the physical line.

because the python tokenizer already does that for me, and in the end the code is distributed as .pyc. Too bad, because I clearly see how to do that cleanly (the only slightly tricky part is the convoluted syntax of string literal in python).

My problem is, a cursory look at the python source code I have to filter shows that it includes loads of comments that are not introduced by #, but simply are string literals that perform no useful task. These are definetly kept in the .pyc tokenized file. They are all over the place, I'm told to facilitate automatic generation of documentation, and editing. Many of these string literals that really are comments are embedded in function definitions, like:

def OnForceStatusChoice(self,event):
    """Action when a status is selected"""
    self.ExecutionPanel.SetFocus()

On the other hand there are loads of string literals that are useful text, including English text to be displayed to the user, and initialization of tables. That makes it hopeless to automatically and safely recognize those string literals that really are comments from the value of the string literal.

From my sampling, most string literals that really are comments seem to be introduced by """ (with few enough exceptions that I perhaps could live with that), but I understand enough python to know that I can't safely remove all these string literals.

Can I safely (or with some stated and reasonable assumption on coding style) assume that

If the first thing in a .py file, ignoring # comments, is a string literal, it can be removed, recursively? If yes, can this rule be made more powerful by ignoring (and keeping) other things beside # comments?
Any string literal starting on the leftmost column of any line can be removed?
Any string literal starting after something syntactically matching a function definition (like the above def) can be removed? If yes, how do I precisely define syntactically matching a function definition?

Please answer like I can't tell python from a random collection of bytes, which is not far from reality.

Your premise is flawed: this isn't really going to contribute to compactness or obfuscation. Also, code obfuscators are totally a thing - you must have performed zero research. — Marcin, May 16 '14 at 17:27
@Marcin: The question explains that I know removing `#` comments from `.py` files does nothing, and I want to remove _other_ comments, wich I'm told are (or at least include) _docstrings_. Removing (or emptying) docstrings definitely contributes to compactness (and, slightly, to obfuscation), and I aim at doing that. — fgrieu, May 16 '14 at 18:00
I'm sorry, but you're still achieving next to nothing, and did no research. — Marcin, May 16 '14 at 18:08

score 6 · Accepted Answer · edited May 23 '17 at 12:14

6

What you are calling comments are really docstrings:

A string literal appearing as the first statement in the function body is transformed into the function’s __doc__ attribute and therefore the function’s docstring.

and from the glossary:

A string literal which appears as the first expression in a class, function or module. While ignored when the suite is executed, it is recognized by the compiler and put into the __doc__ attribute of the enclosing class, function or module. Since it is available via introspection, it is the canonical place for documentation of the object.

Compile the project to .pyo files by using the -OO command line switch:

-O
Turn on basic optimizations. This changes the filename extension for compiled (bytecode) files from .pyc to .pyo. See also PYTHONOPTIMIZE.

-OO
Discard docstrings in addition to the -O optimizations.

You can compile all files in your project using the compileall module as a command line utility:

python -OO -m compileall path/to/project/

However, Python bytecode is trivial to decompile. Removing the docstrings is not going to buy you much.

If you need to something more specialized still, you'll have to learn how to use the ast module to parse Python code into a parse tree, manipulate that tree (e.g. remove all docstrings), then write out transformed Python code. See Parse a .py file, read the AST, modify it, then write back the modified source code for some pointers in that direction.

edited May 23 '17 at 12:14

Community

1
1

answered May 16 '14 at 17:38

Martijn Pieters

1,048,767
296
4,058
3,343

Thanks; some issues: A) I want to remove doctrings, but I'd rather keep assertions, as these can prevent disasters. B) There is reportedly [that issue](http://stackoverflow.com/questions/4777113/what-does-python-optimization-o-or-pythonoptimize-do#comment5290961_4777156), so I really want to empty docstrings, which I do not know how to do, see my 3. C) I do not know if those string literals at the beginning of `.py` files are doctrings (but I can try that). D) Using `.pyo` instead of `.pyc` may require changes to the distribution/installation procedure, I need to have this checked. – fgrieu May 16 '14 at 17:56
2

@fgrieu: `.pyo` files are really just `.pyc` files but use a new extension to reflect that they no longer hold assert statements and have `__debug__` set to `False`; you can rename them if you so wish. – Martijn Pieters May 16 '14 at 17:58
1

@fgrieu: the strings at the top of `.py` files are docstrings. A `.py` file is a module. – Martijn Pieters May 16 '14 at 17:59
2

@fgrieu: Your *test suite* should be preventing the disasters. Asserts are nice for developers, but test suites are better. – Martijn Pieters May 16 '14 at 17:59
@fgrieu: Any code that actually uses assertions to alter behaviour should be printed out on paper and fed to the developer that wrote that. – Martijn Pieters May 16 '14 at 18:01
1

@fgrieu: I gave you the other option that you have; parse, modify parse tree, write out modified code. – Martijn Pieters May 16 '14 at 18:07
1

Many thanks for the very relevant extra info. Yes, in an ideal world, the test suite should catch all possible errors (at least, there is a test suite in that project!); but practice shows that the environment of the code will change, and asserts will help. A good compromise would be some `really_assert` that is never stripped. – fgrieu May 16 '14 at 18:08
3

@fgrieu: `if some_constraint_not_met: raise AssertionError('Error message')`. Use that instead of `assert` statements and it'll never get stripped. – Martijn Pieters May 16 '14 at 18:10

Removing string literals that really are comments from python source code?

1 Answers1