I'm working with a Python source code corpus. I would like the strings to be replaced with STRING
. Python strings are annoying because they allow so many delimiters. Here is what I've tried and the issues I've run into.
r'"(\\"|[^"])*"'
andr"'(\\'|[^'])*'"
This doesn't work because if a string contains the opposite delimiter.
r'(\'|"|\'\'\'|""")(?:\\\1|(?!\1))*\1'
This was my attempt at a catch all, but the lookahead doesn't work. I basically wanted
r'(\'|"|\'\'\'|""")(?:\\\1|[^\1])*\1'
if that were possible.Multiline strings mess stuff up. You can't use
[^"""]
because"""
is not one character.- Strings that contain the other delimiters like
"'"
. - Strings that escape the delimiter like
'\''
.
These are the kinds of strings that need to be matched. The entire block is a string with the delimiters included.
'/$\'"`'
'\\'
'^__[\'\\"]([^\'\\"]*)[\'\\"]'
"Couldn't do that"
These are all valid strings, but you can probably see where it might be hard to match them. Essentially, I want this:
def hello_world():
print("'blah' \"blah\"")
To become:
def hello_world():
print( STRING )
For simplicity sake, let's say the entire Python file is inside of a string. Right now I am reading a file line by line, but I could treat it as one string if necessary. It really doesn't matter how the file is read. If your solution requires a specific method, I will use that. I am not sure this problem can be solved entirely with regex. If you have a solution that involves other code, that would be much appreciated as well.