0

Could you please help me define a regex that would:

  • match the word r'(\d+_\d\d\d(?:_back)?)'
  • "word" means that it shouldn't be preceded or followed by anything except for the proper punctuation signs or beginning/end of string/line
  • work in multiline strings, anywhere in the strings, and in strings consisting only of this pattern and nothing else
  • not match in %96_175" and 44_5555 (because neither the % nor the 4th "5" are punctuation characters).

Examples: Pass (12_345, 012_345, or 012_345_back is the found group):

['12_345',
 'bla-bla 012_345',
 'bla-bla 12_345 bla-bla',
 '34\n012_345',
 '012_345\n34',
 'text—012_345—text',
 'text--12_345, text',
 'text. 012_345_back.']

Fail (no match here):

[
 'text12_345',
 '12_345text',
 '12_3456',
 '%12_345',
 '!12_345',
 '.12-345',
 '12_345_front'
]

What I am trying to distinguish is the proper identifier of the form \d+_\d\d\d(?:_back), inserted by a user in a comment in my web-site, from the same string being part of another string. The simple regex worked until someone inserted a link to a Wikipedia article ending with "№_175', which was URL-encoded to %E2%84%96_175, "96_175" matching my pattern.

I've got stuck at trying to match the "proper punctuation signs" or the beginning or end of string or line in a string. And by then the regex was already so complex (I was listing all reasonable unicode punctuation characters I could think of) that I thought I was doing something wrong. I also have difficulties excluding extra digits but including possible end of line or string.

texnic
  • 3,959
  • 4
  • 42
  • 75

2 Answers2

1

Depending how do you need to handle (or not-handle) non-letter non-proper-punctuation symbols you can either rely on Python re word detection \b (as suggested by one of answers) or enumerate the 'proper' punctuation marks in opening and closing non-matching group.

With old regex (Python 2.5) you could use a punctuation wildcard \p

(?:\p*|^|\s)(\d+_\d\d\d)(_back)?(?:\n|\p|$|\s)

With modern re (Python 2.6 and higher) just replace \p with string.punctuation along the lines of https://stackoverflow.com/a/37708340/5874981

For starter, assuming that sufficiently 'proper' are only full stop, comma and hyphen try

(?:^|\s|\.|,|-)(\d+_\d\d\d)(_back)?(?:$|\s|\.|,|-)
Community
  • 1
  • 1
Serge
  • 3,387
  • 3
  • 16
  • 34
  • I am fine defining the list of punctuation characters myself. However this solution for whichever reason fails on a few no-find examples: https://repl.it/Hmfk/0 – texnic May 08 '17 at 16:58
  • remember also to escape punctuation like \. – Serge May 08 '17 at 17:40
  • 1
    Punctiation is tricky as many punctuation signs have special meaning in re. For instance '.' (Dot.) in the default mode matches any character except a newline. Brackets denote a group, and question mark has bunch of meanings, depending on the following character. Luckily backslash can be used to match exactly any overloaded punctuation character you need – Serge May 08 '17 at 17:51
  • That was it! I put there . instead of \. So now it works: https://repl.it/Hmfk/1. Thanks! – texnic May 08 '17 at 21:54
  • Why did you put backslash for comma, hyphen? They are not special, are they? Please answer this, then I'll edit your answer to include my pattern (or take it yourself from https://repl.it/Hmfk/3), and then I will accept your answer. BTW, you don't need \n if you have \s. – texnic May 08 '17 at 21:58
  • correct you can remove extraneous signs. While in the above code comma and hyphen have no special meaning, in some expressions they might. For example, if you fancy to use more compact set notation instead of the alternative; hyphen could become a special character that indicates the range. To work around of the challenge of memorizing all the special character meaning and context I often escape punctuation "just in case". Only paranoiacs survive in merciless word of regex :) – Serge May 09 '17 at 02:54
0

I'm not sure if I'm misunderstanding the question but if the only problem you're having is to match a whole word and ignore any other characters than the ones you want, I'd suggest you to try regex word boundary

So your regular expression would be \b\d+_\d\d\d(?:_back)?\b

Give it a try and tell me if that's what you need.

Santiago Alzate
  • 397
  • 3
  • 14
  • will it match words followed by 'unprintable symbols' which are neither letters, punctuation or underscores? If @texnic does not care of these, could be a way to go – Serge May 08 '17 at 13:57
  • It matches `%12_345`, but it shouldn't. What I am trying to distinguish is the proper identifier of the form \d+_\d\d\d(?:_back), inserted by a user in the comment in my site, from the same string being part of another string. The simple regex worked until someone inserted a link to a Wikipedia article ending with "№_175', which was URL-encoded to `%E2%84%96_175`, "96_175" matching my pattern. – texnic May 08 '17 at 16:21