I need to quickly write (or borrow) something, in any language, that automatically filters tons of python source code in order to remove comments. The goal is to make the code on the target platform more compact (and as an aside reverse engineering even so slightly more difficult). I must positively not modify the code's behavior, and can live with a few leftover comments. My input and output should be a .py
text file, assumed to be valid python 2.x (assume: restricted to ASCII, I'll take care of UTF8).
Strictly speaking, I do not need to remove comments of the kind defined by
A comment starts with a hash character (
#
) that is not part of a string literal, and ends at the end of the physical line.
because the python tokenizer already does that for me, and in the end the code is distributed as .pyc
. Too bad, because I clearly see how to do that cleanly (the only slightly tricky part is the convoluted syntax of string literal in python).
My problem is, a cursory look at the python source code I have to filter shows that it includes loads of comments that are not introduced by #
, but simply are string literals that perform no useful task. These are definetly kept in the .pyc
tokenized file. They are all over the place, I'm told to facilitate automatic generation of documentation, and editing. Many of these string literals that really are comments are embedded in function definitions, like:
def OnForceStatusChoice(self,event):
"""Action when a status is selected"""
self.ExecutionPanel.SetFocus()
On the other hand there are loads of string literals that are useful text, including English text to be displayed to the user, and initialization of tables. That makes it hopeless to automatically and safely recognize those string literals that really are comments from the value of the string literal.
From my sampling, most string literals that really are comments seem to be introduced by """
(with few enough exceptions that I perhaps could live with that), but I understand enough python to know that I can't safely remove all these string literals.
Can I safely (or with some stated and reasonable assumption on coding style) assume that
- If the first thing in a
.py
file, ignoring#
comments, is a string literal, it can be removed, recursively? If yes, can this rule be made more powerful by ignoring (and keeping) other things beside#
comments? - Any string literal starting on the leftmost column of any line can be removed?
- Any string literal starting after something syntactically matching a function definition (like the above
def
) can be removed? If yes, how do I precisely define syntactically matching a function definition?
Please answer like I can't tell python from a random collection of bytes, which is not far from reality.