python regexp not match sequence

Question

I need to wrap some MathJax string with HTML tag. I wonder how to exclude \) from search string not to match full sting. With single char it's easy e.g [^)] but what to do when I need to do the same with e.g. two chars one after another \) ?

search_str = "\(\ce{\sigma_{s}^{b}(H2O)}\) bla bla \(\ce{\sigma_{s}^{b}(H2O)}\)"
out = re.sub(r'(\\\([^\\\)]+\\\))', '<span>\1</span>', search_str)

Lazy matching (`.+?`) should be sufficient, a [tempred greedy token](http://stackoverflow.com/a/37343088/3836111) might be better. Both won't help with nested parenthesis. — Sebastian Proske, Dec 08 '16 at 14:28

Wiktor Stribiżew · Accepted Answer · 2020-01-24T22:54:06.450

You are trying to match any text but \) 2-char substring, 2-char sequence of characters, with [^\\\)]+, which is wrong, because [^...] is a negated cahracter class that can match a single character falling into a specific range or set of chars defined in the class. It can never match char combinations, * or + quantifiers just repeat a single char matching.

What you think of is called a tempered greedy token, (?:(?!\\\)).)* or (?:(?!\\\)).)*?.

However, the tempered greedy token is not the best practice in this case. See the rexegg.com note on when not to use TGT:

For the task at hand, this technique presents no advantage over the lazy dot-star .*?{END}. Although their logic differs, at each step, before matching a character, both techniques force the engine to look if what follows is {END}.

The comparative performance of these two versions will depend on your engine's internal optimizations. The pcretest utility indicates that PCRE requires far fewer steps for the lazy-dot-star version. On my laptop, when running both expressions a million times against the string {START} Mary {END}, pcretest needs 400 milliseconds per 10,000 runs for the lazy version and 800 milliseconds for the tempered version.

Therefore, if the string that tempers the dot is a delimiter that we intend to match eventually (as with {END} in our example), this technique adds nothing to the lazy dot-star, which is better optimized for this job.

Your strings seem to be well-formed and rather short, use a mere lazy dot matching pattern, that is, \\\(.*?\\\) regex.

Besides, you need to use r prefix, a raw string literal, in the replacement pattern definition, or \1 will be parsed as a hex char (\x01, start of header).

import re
search_str = r"\(\ce{\sigma_{s}^{b}(H2O)}\) bla bla \(\ce{\sigma_{s}^{b}(H2O)}\)"
print(search_str)
out = re.sub(r'(\\\(.*?\\\))', r'<span>\1</span>', search_str)
print(out)

See the Python demo

score 0 · Answer 2 · answered Dec 08 '16 at 14:26

I think that [^\\][^)] should do the trick, or. nearly so. That will match any two characters as long as the first isn't a slash, and the second isn't a closing paren. You could experiment with some grouping, too, if that's not exactly what you want.

score 0 · Answer 3 · answered Dec 09 '16 at 07:54

0

Thank to Sebastian's recommendation I used Tempered Greedy Token:

(\\\((?:(?!\\\)).)*\\\)

simply awesome :-)

answered Dec 09 '16 at 07:54

d3im

323
2
4
18

python regexp not match sequence

3 Answers3

Linked

Related