1

Suppose that we have some input like this (it's an example, no matter if it makes sense or not):

data = "(((column_1 + 7.45) * 3) <>    column_2 - ('string\'1' / 2))"

Well, I need to match a string, that starts and ends with ' and may contain escaped single quotes as example above, using Python re module. So the result should be string\'1. How can we achieve it?

EDIT: I am using the PLY library and the usage should be

def t_leftOperand_arithmetic_rightOperand_STRING(self, t):
    r'<regex>'
    t.lexer.pop_state()
    return t
kubisma1
  • 307
  • 5
  • 13

1 Answers1

1

I believe you have to account for the escape being escaped as well.

For that, you'd need '[^'\\]*(?:\\[\S\s][^'\\]*)*'


Input

'''Set 1 - this
is another
mul\'tiline
string'''
'''Set 2 - this
is' a\\nother
mul\'''tiline
st''ring'''

Benchmark:

Regex1:   '[^'\\]*(?:\\[\S\s][^'\\]*)*'
Options:  < none >
Completed iterations:   400  /  400     ( x 1000 )
Matches found per iteration:   9
Elapsed Time:    5.00 s,   4995.27 ms,   4995267 µs


Regex2:   '(?:[^'\\]|\\.)*'
Options:  < s >
Completed iterations:   400  /  400     ( x 1000 )
Matches found per iteration:   9
Elapsed Time:    7.00 s,   7000.68 ms,   7000680 µs

Additional regex (For a test only. As @ridgerunner says this could cause a backtracking problem)

Regex2:   '(?:[^'\\]+|\\.)*'
Options:  < s >
Completed iterations:   400  /  400     ( x 1000 )
Matches found per iteration:   9
Elapsed Time:    5.45 s,   5449.72 ms,   5449716 µs
  • 1
    Or just `'(?:[^'\\]|\\.)*'` I think. – Zastai Feb 14 '16 at 20:02
  • @Zasta - I stay away from using that form. One reason is its much slower. The other reason is it might run into stack overflow if embedded into a heavy expression. I'll throw up a benchmark. –  Feb 14 '16 at 20:04
  • Fair enough - but unless there are specific performance considerations, I try to go for the most readable (although the 40% difference here is not small). Any specific reason for the classes instead of a dot, other than matching a newline (for which DOTALL could be set if required)? – Zastai Feb 14 '16 at 20:17
  • And since you're benching anyway, does `'(?:[^'\\]*|\\.)*'` help matters any? – Zastai Feb 14 '16 at 20:20
  • Couple of answers. First off, you'd never want to leave something as optional in an optionally quantified group, ie. like this '(?:`[^'\\]*`|\\.)*' something like `'(?:[^'\\]+|\\.)*'` is ok. Secondly, the more escaped items in a string will exponentially slow down the alternation method, compared to the unrolled-loop method. Third, the dot `.` will match tabs, form feed and other control characters, so why exclude a newline. (?s) is acceptable (I added it in the benchmark). And, I'll post `'(?:[^'\\]+|\\.)*'` results as well. –  Feb 14 '16 at 20:48
  • @Zastai - No. The expression `'(?:[^'\\]*|\\.)*'` is subject to catastrophic backtracking for cases where the string does not match. BTW, this question gets asked a lot - e.g. [PHP: Regex to ignore escaped quotes within quotes](http://stackoverflow.com/a/5696141/433790) – ridgerunner Feb 14 '16 at 20:53
  • @sin - `'(?:[^'\\]+|\\.)*'` suffers the same catastrophic backtracking problem when faced with a non-matching string. – ridgerunner Feb 14 '16 at 20:56
  • Yep, it does. That's why I always leave alternations unquantified. Still, this `'(?:[^'\\]|\\.)*'` is subject to stack overflow in some situations. –  Feb 14 '16 at 20:58