Acc. to Onigmo regex documentation, capturing groups are not supported in negative lookbehinds. Although it is common among regex engines, not all of them count it as an error, hence you see the difference in the re
and Onigmo regex libraries.
Now, as for your regex, it is not working correctly nor in Ruby nor in Python: the \b
inside a character class in a Python and Ruby regex matches a BACKSPACE (\x08
) char, not a word boundary. Moreover, when you use a word boundary after an optional non-word char, if the char appears in the string a word char must appear immediately to the right of that non-word char. The word boundary must be moved to right after m
before \.?
.
Another flaw with the current approach is that lookbehinds are not the best to exclude certain contexts like here. E.g. you can't account for a variable amount of whitespaces between the time digits and am
/ pm
. It is better to match the contexts you do not want to touch and match and capture those you want to modify. So, we need two main alternatives here, one matching am
/pm
in time strings and another matching them in all other contexts.
Your pattern also has too many alternatives that can be merged using character classes and ?
quantifiers.
Regex demo
\b((?:[01]?[0-9]|2[0-3]):[0-5][0-9]\s*[pa]\.?m\b\.?)
:
\b
- word boundary
((?:[01]?[0-9]|2[0-3]):[0-5][0-9]\s*[pa]\.?m\b\.?)
- capturing group 1:
(?:[01]?[0-9]|2[0-3])
- an optional 0
or 1
and then any digit or 2
and then a digit from 0
to 3
:[0-5][0-9]
- :
and then a number from 00
to 59
\s*
- 0+ whitespaces
[pa]\.?m\b\.?
- a
or p
, an optional dot, m
, a word boundary, an optional dot
|
- or
\b[ap]\.?m\b\.?
- word boundary, a
or p
, an optional dot, m
, a word boundary, an optional dot
Python fixed solution:
import re
text = 'am pm P.M. 10:56pm 10:43 a.m.'
rx = r'\b((?:[01]?[0-9]|2[0-3]):[0-5][0-9]\s*[pa]\.?m\b\.?)|\b[ap]\.?m\b\.?'
result = re.sub(rx, lambda x: x.group(1) if x.group(1) else "\n", text, flags=re.I)
Ruby solution:
text = 'am pm P.M. 10:56pm 10:43 a.m.'
rx = /\b((?:[01]?[0-9]|2[0-3]):[0-5][0-9]\s*[pa]\.?m\b\.?)|\b[ap]\.?m\b\.?/i
result = text.gsub(rx) { $1 || "\n" }
Output:
"\n \n \n 10:56pm 10:43 a.m."