1

I'm working with a Python source code corpus. I would like the strings to be replaced with STRING. Python strings are annoying because they allow so many delimiters. Here is what I've tried and the issues I've run into.

  • r'"(\\"|[^"])*"' and r"'(\\'|[^'])*'"

    This doesn't work because if a string contains the opposite delimiter.

  • r'(\'|"|\'\'\'|""")(?:\\\1|(?!\1))*\1'

    This was my attempt at a catch all, but the lookahead doesn't work. I basically wanted r'(\'|"|\'\'\'|""")(?:\\\1|[^\1])*\1' if that were possible.

  • Multiline strings mess stuff up. You can't use [^"""] because """ is not one character.

  • Strings that contain the other delimiters like "'".
  • Strings that escape the delimiter like '\''.

These are the kinds of strings that need to be matched. The entire block is a string with the delimiters included.

  • '/$\'"`'
  • '\\'
  • '^__[\'\\"]([^\'\\"]*)[\'\\"]'
  • "Couldn't do that"

These are all valid strings, but you can probably see where it might be hard to match them. Essentially, I want this:

def hello_world():
    print("'blah' \"blah\"")

To become:

def hello_world():
    print( STRING )

For simplicity sake, let's say the entire Python file is inside of a string. Right now I am reading a file line by line, but I could treat it as one string if necessary. It really doesn't matter how the file is read. If your solution requires a specific method, I will use that. I am not sure this problem can be solved entirely with regex. If you have a solution that involves other code, that would be much appreciated as well.

jackl
  • 127
  • 8
  • 4
    Why not process this at the AST level, rather than trying to regex the source? – jonrsharpe Feb 28 '20 at 20:37
  • I am also considering that approach, but I want to test this approach as well. – jackl Feb 28 '20 at 20:38
  • Why not join the four regexes for `"""`, `'''`, `"` and `'` with `|` between them? – Kelly Bundy Feb 28 '20 at 20:40
  • I've tried that, but I am having trouble using a lookahead. – jackl Feb 28 '20 at 20:41
  • `r'(\'|"|\'\'\'|""")(?:\\\1|(?!\1))*\1'` – jackl Feb 28 '20 at 20:45
  • I mean `'|'.join([r'"(\\"|[^"])*"', r"'(\\'|[^'])*'"])` (but with the triplers as well). – Kelly Bundy Feb 28 '20 at 20:48
  • Oh, I see what you mean now. I will try it. – jackl Feb 28 '20 at 20:48
  • You're also missing raw strings (`r'...'`), f-strings (`f'...'`), and probably some others. I suspect f-strings in particular are unparseable with a regex because they can contain arbitrary Python expressions, including other string literals. – Emily Feb 28 '20 at 21:05
  • 1
    @Mike Can you show an example of a problematic f-string? – Kelly Bundy Feb 28 '20 at 21:11
  • @HeapOverflow It looks like I was wrong; expressions in f-strings aren't allowed to contain the same quote character that was used to enclose the string, even escaped. That breaks every way I can think of to make arbitrarily deep f-strings. (I assume I was thinking of the C# equivalent, where `$"a string{"another string"}"` is allowed and requires no escaping.) – Emily Mar 02 '20 at 02:07

1 Answers1

1

You can try a regex that matches quoted strings but allows escaping:

[rR]?(?:'([^\\']*(?:\\.[^\\']*)*)'|"([^\\"]*(?:\\.[^\\"]*)*)")

Demo

While this may capture the majority of strings I am pretty sure there are still some exceptions.

This is based on J. Friedl's unrolling the loop technique:

Unrolling the Loop (using double quotes)

"                              # the start delimiter
 ([^\\"]*                      # anything but the end of the string or the escape char
         (?:\\.                #     the escape char preceding an escaped char (any char)
               [^\\"]*         #     anything but the end of the string or the escape char
                      )*)      #     repeat
                             " # the end delimiter
wp78de
  • 18,207
  • 7
  • 43
  • 71