You are trying to match any text but \)
2-char substring, 2-char sequence of characters, with [^\\\)]+
, which is wrong, because [^...]
is a negated cahracter class that can match a single character falling into a specific range or set of chars defined in the class. It can never match char combinations, *
or +
quantifiers just repeat a single char matching.
What you think of is called a tempered greedy token, (?:(?!\\\)).)*
or (?:(?!\\\)).)*?
.
However, the tempered greedy token is not the best practice in this case. See the rexegg.com note on when not to use TGT:
For the task at hand, this technique presents no advantage over the lazy dot-star .*?{END}
. Although their logic differs, at each step, before matching a character, both techniques force the engine to look if what follows is {END}
.
The comparative performance of these two versions will depend on your engine's internal optimizations. The pcretest utility indicates that PCRE requires far fewer steps for the lazy-dot-star version. On my laptop, when running both expressions a million times against the string {START} Mary {END}
, pcretest needs 400 milliseconds per 10,000 runs for the lazy version and 800 milliseconds for the tempered version.
Therefore, if the string that tempers the dot is a delimiter that we intend to match eventually (as with {END}
in our example), this technique adds nothing to the lazy dot-star, which is better optimized for this job.
Your strings seem to be well-formed and rather short, use a mere lazy dot matching pattern, that is, \\\(.*?\\\)
regex.
Besides, you need to use r
prefix, a raw string literal, in the replacement pattern definition, or \1
will be parsed as a hex char (\x01
, start of header).
import re
search_str = r"\(\ce{\sigma_{s}^{b}(H2O)}\) bla bla \(\ce{\sigma_{s}^{b}(H2O)}\)"
print(search_str)
out = re.sub(r'(\\\(.*?\\\))', r'<span>\1</span>', search_str)
print(out)
See the Python demo