13

Is it possible to match a list of comma-separated decimal integers, where the integers in the list always increment by one?

These should match:

0,1,2,3
8,9,10,11
1999,2000,2001
99,100,101

These should not match (in their entirety - the last two have matching subsequences):

42
3,2,1
1,2,4
10,11,13
NikiC
  • 100,734
  • 37
  • 191
  • 225

1 Answers1

24

Yes, this is possible when using a regex engine that supports backreferences and conditions.

First, the list of consecutive numbers can be decomposed into a list where each pair of numbers are consecutive:

(?=(?&cons))\d+
(?:,(?=(?&cons))\d+)*
,\d+

Here (?=(?&cons)) is a placeholder for a predicate that ensures that two numbers are consecutive. This predicate might look as follows:

(?<cons>\b(?:
    (?<x>\d*)
    (?:(?<a0>0)|(?<a1>1)|(?<a2>2)|(?<a3>3)|(?<a4>4)
               |(?<a5>5)|(?<a6>6)|(?<a7>7)|(?<a8>8))
    (?:9(?= 9*,\g{x}\d (?<y>\g{y}?+ 0)))*
    ,\g{x}
    (?(a0)1)(?(a1)2)(?(a2)3)(?(a3)4)(?(a4)5)
    (?(a5)6)(?(a6)7)(?(a7)8)(?(a8)9)
    (?(y)\g{y})
    # handle the 999 => 1000 case separately
  | (?:9(?= 9*,1 (?<z>\g{z}?+ 0)))+
    ,1\g{z}
)\b)

For a brief explanation, the second case handling 999,1000 type pairs is easier to understand -- there is a very detailed description of how it works in this answer concerned with matching a^n b^n. The connection between the two is that in this case we need to match 9^n ,1 0^n.

The first case is slightly more complicated. The largest part of it handles the simple case of incrementing a decimal digit, which is relatively verbose due to the number of said digits:

(?:(?<a0>0)|(?<a1>1)|(?<a2>2)|(?<a3>3)|(?<a4>4)
           |(?<a5>5)|(?<a6>6)|(?<a7>7)|(?<a8>8))

(?(a0)1)(?(a1)2)(?(a2)3)(?(a3)4)(?(a4)5)
(?(a5)6)(?(a6)7)(?(a7)8)(?(a8)9)

The first block will capture whether the digit is N into group aN and the second block will then uses conditionals to check which of these groups was used. If group aN is non-empty, the next digit should be N+1.

The remainder of the first case handles cases like 1999,2000. This again falls into the pattern N 9^n, N+1 0^n, so this is a combination of the method for matching a^n b^n and incrementing a decimal digit. The simple case of 1,2 is handled as the limiting case where n=0.

Complete regex: https://regex101.com/r/zG4zV0/1


Alternatively the (?&cons) predicate can be implemented slightly more directly if recursive subpattern references are supported:

(?<cons>\b(?:
    (?<x>\d*)
    (?:(?<a0>0)|(?<a1>1)|(?<a2>2)|(?<a3>3)|(?<a4>4)
               |(?<a5>5)|(?<a6>6)|(?<a7>7)|(?<a8>8))
    (?<y>
        ,\g{x}
        (?(a0)1)(?(a1)2)(?(a2)3)(?(a3)4)(?(a4)5)
        (?(a5)6)(?(a6)7)(?(a7)8)(?(a8)9)
      | 9 (?&y) 0
    )
    # handle the 999 => 1000 case separately
  | (?<z> 9,10 | 9(?&z)0 )
)\b)

In this case the two grammars 9^n ,1 0^n, n>=1 and prefix N 9^n , prefix N+1 0^n, n>=0 are pretty much just written out explicitly.

Complete alternative regex: https://regex101.com/r/zG4zV0/3

Community
  • 1
  • 1
NikiC
  • 100,734
  • 37
  • 191
  • 225
  • Nice post. I've got a general question regarding _nested back reference_ This expression in your regex for example `(?\g{y}?+0)`. I use the _Boost regex_ engine which is mostly PCRE and Perl compatible. Sometimes I have to use workarounds for certain situations. The one for this is for example `(?(?&_y)?+0)(?(DEFINE)(?<_y>\g{y}))` Drives me nuts sometimes with `undefined backref` message, that if a workaround can be made, why not support it outright. Was wondering if you know a better workaround that doesn't require a function call? (_workaround https://regex101.com/r/zG4zV0/2_). –  Sep 04 '16 at 23:47
  • @sln Sorry, not familiar with the boost regex engine. If I understand correctly, the problem is that boost does not support backrefs to groups that haven't been (fully) matched yet, correct? In that case your workaround seems reasonable. I've also added an alternative solution based on recursive subpattern references, maybe that works out of the box... – NikiC Sep 05 '16 at 11:23
  • It's compile time, during parsing: `groups that haven't been (fully) parsed yet`. So the reference is to it's parent capture group. I solved the problem by allowing backrefs to this mark at the time the group is opened, but before inner states are added via recursive parse. Formerly, it only allowed backrefs to this mark when the closing `)` was parsed. It's a simple bitmask `m_backrefs |= 1<<(markid-1)`, if the closing paren is not found an error is generated and the whole thing unwinds anyway. Works as expected. I just add this to the other mods I've made. –  Sep 06 '16 at 14:47
  • 1
    Congrats. You succeeded to do maths with a string manipulator system. – Thomas Ayoub Sep 07 '16 at 14:20
  • @ThomasAyoub - It's not actually math per se. If the OP were to use Perl, actual math could be done in a regex via code assertion's `(?{})` in a fairly simple way. –  Sep 08 '16 at 20:11
  • @sln PCRE has a feature called _callout_ allowing for arbitrary assertions. They invoke a function when `(?CX)` (where X is a number between 1 and 255) is encountered in a pattern and pass it the position in the string… the called function can then do different assertions depending on the also passed number X (from `(?CX)`)… which is basically equivalent to that. It's just not "inline" in the pattern itself. – bwoebi Sep 09 '16 at 17:26
  • @bwoebi - Thanks. I've heard of the callout. In Perl, the inline assertion `(?{code})` can be used in conditionals and the `(??{ code })` construct can _inject_ regex expressions at that point with it's return string. A synergistic approach. How they do it is beyond me, probably heavy use of **eval**. –  Sep 09 '16 at 17:43