5

Why does this regex work in Python but not in Ruby:

/(?<!([0-1\b][0-9]|[2][0-3]))/

Would be great to hear an explanation and also how to get around it in Ruby

EDIT w/ the whole line of code:

re.sub(r'(?<!([0-1\b][0-9]|[2][0-3])):(?!([0-5][0-9])((?i)(am)|(pm)|(a\.m)|(p\.m)|(a\.m\.)|(p\.m\.))?\b)' , ':\n' , s)

Basically, I'm trying to add '\n' when there is a colon and it is not a time.

mrzasa
  • 22,895
  • 11
  • 56
  • 94
echan00
  • 2,788
  • 2
  • 18
  • 35
  • 1
    What do you want to match with `\b`? What is your pattern supposed to match? – Wiktor Stribiżew Jul 18 '19 at 21:39
  • What is this supposed to match or not match on? Give us use cases, too, to test. – tadman Jul 18 '19 at 21:41
  • Btw, if a negative look behind assertion matches, it will never be able to capture anything in it's content constructs. –  Jul 18 '19 at 21:52
  • 1
    @tadman added more info – echan00 Jul 18 '19 at 21:54
  • @WiktorStribiżew updated the post – echan00 Jul 18 '19 at 21:55
  • @sln Error is "Invalid pattern in look-behind" – echan00 Jul 18 '19 at 22:00
  • 1
    https://www.regexplanet.com/share/index.html?share=yyyyuwadc6r The fix https://www.regexplanet.com/share/index.html?share=yyyydutj3ar –  Jul 18 '19 at 22:03
  • So, just change the beginning to `(?<![0-1\b][0-9])(?<![2][0-3])` and it works for all engines. –  Jul 18 '19 at 22:07
  • The point is that `\b` when used in a character class is no longer a word boundary. In Python, it is a backspace char. In Ruby, the problem seems to be related to the capturing group inside the negative lookbehind. `(?<!([0-1\b][0-9]|[2][0-3]))` must be turned into `(?<![0-1][0-9]|[2][0-3])(?<!\b\d)` - if all you want is to match either `[0-1]` or word boundary with `[0-1\b]` – Wiktor Stribiżew Jul 18 '19 at 22:07
  • Does [my answer](https://stackoverflow.com/a/57107659/3832970) work for you in both Ruby and Python? – Wiktor Stribiżew Aug 06 '19 at 09:50

3 Answers3

4

Ruby regex engine doesn't allow capturing groups in look behinds. If you need grouping, you can use a non-capturing group (?:):

[8] pry(main)> /(?<!(:?[0-1\b][0-9]|[2][0-3]))/
SyntaxError: (eval):2: invalid pattern in look-behind: /(?<!(:?[0-1\b][0-9]|[2][0-3]))/
[8] pry(main)> /(?<!(?:[0-1\b][0-9]|[2][0-3]))/
=> /(?<!(?:[0-1\b][0-9]|[2][0-3]))/

Docs:

 (?<!subexp)        negative look-behind

                     Subexp of look-behind must be fixed-width.
                     But top-level alternatives can be of various lengths.
                     ex. (?<=a|bc) is OK. (?<=aaa(?:b|cd)) is not allowed.

                     In negative look-behind, capturing group isn't allowed,
                     but non-capturing group (?:) is allowed.

Learned from this answer.

mrzasa
  • 22,895
  • 11
  • 56
  • 94
  • I saw the capture groups causing error in lookbehinds, but when I change it to non-capture groups it times out. –  Jul 18 '19 at 22:19
  • Should `(?<=aaa(?:b|cd))` be `(?<=aaa(b|cd))` that is not allowed ? –  Jul 18 '19 at 22:20
  • What is the string it times out on? Maybe there is some excessive bactracking – mrzasa Jul 18 '19 at 22:21
  • It times out on `(?<!(?:[0-1][0-9]|[2][0-3]))` not a lot to backtrack on with this. link https://www.regexplanet.com/share/index.html?share=yyyyda5tvar –  Jul 18 '19 at 22:23
  • What is the string that you're matching with this regex? – mrzasa Jul 18 '19 at 22:26
  • It works fast on my computer: `[10] pry(main)> /(?<!(?:[0-1][0-9]|[2][0-3]))/ =~ ':: ;:;' => 0` – mrzasa Jul 18 '19 at 22:30
  • I guess from that answer, the non-capture group in this `(?<=aaa(?:b|cd))` is not the _top level_ but it should be for `(?<=(?:a|bc))` right ? But, if only that's allowed, then you don't need `(?: )` ever at all. –  Jul 18 '19 at 22:32
  • That's why I don't trust online ruby testers. –  Jul 18 '19 at 22:32
  • I'm gonna have to rate Ruby as having a bizzaro world regex engine. haha –  Jul 18 '19 at 22:36
2

Acc. to Onigmo regex documentation, capturing groups are not supported in negative lookbehinds. Although it is common among regex engines, not all of them count it as an error, hence you see the difference in the re and Onigmo regex libraries.

Now, as for your regex, it is not working correctly nor in Ruby nor in Python: the \b inside a character class in a Python and Ruby regex matches a BACKSPACE (\x08) char, not a word boundary. Moreover, when you use a word boundary after an optional non-word char, if the char appears in the string a word char must appear immediately to the right of that non-word char. The word boundary must be moved to right after m before \.?.

Another flaw with the current approach is that lookbehinds are not the best to exclude certain contexts like here. E.g. you can't account for a variable amount of whitespaces between the time digits and am / pm. It is better to match the contexts you do not want to touch and match and capture those you want to modify. So, we need two main alternatives here, one matching am/pm in time strings and another matching them in all other contexts.

Your pattern also has too many alternatives that can be merged using character classes and ? quantifiers.

Regex demo

  • \b((?:[01]?[0-9]|2[0-3]):[0-5][0-9]\s*[pa]\.?m\b\.?):
    • \b - word boundary
    • ((?:[01]?[0-9]|2[0-3]):[0-5][0-9]\s*[pa]\.?m\b\.?) - capturing group 1:
      • (?:[01]?[0-9]|2[0-3]) - an optional 0 or 1 and then any digit or 2 and then a digit from 0 to 3
      • :[0-5][0-9] - : and then a number from 00 to 59
      • \s* - 0+ whitespaces
      • [pa]\.?m\b\.? - a or p, an optional dot, m, a word boundary, an optional dot
  • | - or
  • \b[ap]\.?m\b\.? - word boundary, a or p, an optional dot, m, a word boundary, an optional dot

Python fixed solution:

import re
text = 'am pm  P.M.  10:56pm 10:43 a.m.'
rx = r'\b((?:[01]?[0-9]|2[0-3]):[0-5][0-9]\s*[pa]\.?m\b\.?)|\b[ap]\.?m\b\.?'
result = re.sub(rx, lambda x: x.group(1) if x.group(1) else "\n", text, flags=re.I)

Ruby solution:

text = 'am pm  P.M.  10:56pm 10:43 a.m.'
rx = /\b((?:[01]?[0-9]|2[0-3]):[0-5][0-9]\s*[pa]\.?m\b\.?)|\b[ap]\.?m\b\.?/i
result = text.gsub(rx) { $1 || "\n" }

Output:

"\n \n  \n  10:56pm 10:43 a.m."
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
0

For sure @mrzasa found the problem out.

But .. Taking a guess at your intent to replace a non-time colon with a ':\n`
it could be done like this I guess. Does a little whitespace trim as well.

(?i)(?<!\b[01][0-9])(?<!\b[2][0-3])([^\S\r\n]*:)[^\S\r\n]*(?![0-5][0-9](?:[ap]\.?m\b\.?)?)

PCRE - https://regex101.com/r/7TxbAJ/1 Replace $1\n

Python - https://regex101.com/r/w0oqdZ/1 Replace \1\n

Readable version

 (?i)
 (?<!
      \b [01] [0-9] 
 )
 (?<!
      \b [2] [0-3] 
 )
 (                             # (1 start)
      [^\S\r\n]* 
      :
 )                             # (1 end)
 [^\S\r\n]* 
 (?!
      [0-5] [0-9] 
      (?: [ap] \.? m \b \.? )?
 )