Could you please help me define a regex that would:
- match the word
r'(\d+_\d\d\d(?:_back)?)'
- "word" means that it shouldn't be preceded or followed by anything except for the proper punctuation signs or beginning/end of string/line
- work in multiline strings, anywhere in the strings, and in strings consisting only of this pattern and nothing else
- not match in
%96_175"
and44_5555
(because neither the % nor the 4th "5" are punctuation characters).
Examples: Pass (12_345, 012_345, or 012_345_back is the found group):
['12_345',
'bla-bla 012_345',
'bla-bla 12_345 bla-bla',
'34\n012_345',
'012_345\n34',
'text—012_345—text',
'text--12_345, text',
'text. 012_345_back.']
Fail (no match here):
[
'text12_345',
'12_345text',
'12_3456',
'%12_345',
'!12_345',
'.12-345',
'12_345_front'
]
What I am trying to distinguish is the proper identifier of the form \d+_\d\d\d(?:_back), inserted by a user in a comment in my web-site, from the same string being part of another string. The simple regex worked until someone inserted a link to a Wikipedia article ending with "№_175', which was URL-encoded to %E2%84%96_175
, "96_175" matching my pattern.
I've got stuck at trying to match the "proper punctuation signs" or the beginning or end of string or line in a string. And by then the regex was already so complex (I was listing all reasonable unicode punctuation characters I could think of) that I thought I was doing something wrong. I also have difficulties excluding extra digits but including possible end of line or string.