Shorten a RegEx

Question

In Java RegEx I have the following:

(1abc\\d{2})|(2abc\\d{3})|(3abc\\d{4})

I would like to extract the 'abc\d' out of the RegEx and replace the RegEx with something like:

(1|2|3)abc\\d({2]|{3}|{4})

The problem is that 1 belongs to {2} and 2 belongs to {3} and 3 belongs to {4}. So a good match is, 1abc12, but a bad match is 1abc123.

I have recently learned RegEx and I feel like i'm missing some knowledge about RegEx to make this possible. Is it even possible?

As a possible alternative, relax the matching criteria similar to your second expression (but simpler) and use java logic on capturing groups 1 and 2 to filter the results as good or bad. — ccoakley, Dec 09 '11 at 17:13

score 1 · Accepted Answer · edited May 23 '17 at 12:27

1

What you describe is not possible with regular expressions. In general, a later part of the expression cannot depend on the matching result of an earlier part of the expression. For example, you can't write a regex that matches balanced parenthesis or matching HTML tags.

Some implementations provide extensions which give exceptions to this (irregular expressions), but I don't think they apply here.

edited May 23 '17 at 12:27

Community

1
1

answered Dec 09 '11 at 17:13

Jay Conrod

28,943
19
98
110

Dave Webb his answer is also good, but throwing in the term 'irregular expression' pointed me more in the direction of a good explanation of WHY what i'm trying to do isn't possible. – Rafiek Dec 12 '12 at 11:23

score 1 · Answer 2 · answered Dec 09 '11 at 17:16

You can use back references via \n in Regular Expressions to refer to previously matched groups but these only match strings again, they can't change the rules of the pattern.

For example (1|2|3)abc\1 would match 1abc1 and 2abc2 but not 1abc2, i.e the \1 will match what was found the first bracket.

Ideally, we want to do something lie (1|2|3)abc\d{\1 + 1} but Java doesn't support code or expressions within its Regular Expressions.

So unfortunately what you want isn't possible, or rather your first expression is probably as good as it's going to get.

score 1 · Answer 3 · 2011-12-10T22:46:32.090

It could be done in a pseudo-conditional way, but the cure might be worse than the sickness.

The only way I would use something like this (below) is if the 'text' (abc in this case) were something very big to where factoring it out in this way would yield time gains over including it in each alternation as it exists now. An example of some text that would be very large might be 'abc[^\d]+432xyz', or anything that has open-ended quantifiers or that cause huge backtracking.

This works in Java ..

"^(?:1()|2()|3())abc(?:(?=\\1)\\d{2}|(?=\\2)\\d{3}|(?=\\3)\\d{4})$"

(expanded)

^       # Begin, all capture buffers are undefined and empty
  (?:
      1()     # If '1' found, set capture buffer 1 to defined (but empty)
    | 2()     # If '2' found, set capture buffer 2 to defined (but empty)
    | 3()     # If '3' found, set capture buffer 3 to defined (but empty)
  )
  abc      # The text factored out
  (?:
       # The below could also be  \1\d{2}|\2\d{3}|\3\d{4} as well

      (?=\1)\d{2}    #     Assertion: is capt buffer 1 defined?, get next two digits
    | (?=\2)\d{3}    # or, Assertion: is capt buffer 2 defined?, get next three digits
    | (?=\3)\d{4}    # or, Assertion: is capt buffer 3 defined?, get next four digits
  )
$      # End

Also, as someone mentioned, you could do a general capture, then post-process the result to decide if it is valid.

Something like this: ^(1|2|3)abc(\d{2,4})$. then do a switch on capture buffer 1, then cases on the length of capture buffer 2.

You don't actually have to use lookaheads; `\1\d{2}` works just as well. — Alan Moore, Dec 10 '11 at 01:18

score 0 · Answer 4 · answered Dec 09 '11 at 17:09

0

Since the numbers 1,2,3 are somewhat linked to your regex groups {2}, {3} and {4} respectively, I think there is no way to extact the common subexpression.

answered Dec 09 '11 at 17:09

Olivier Croisier

6,139
25
34

score 0 · Answer 5 · answered Dec 09 '11 at 17:58

0

Not a perfect solution, but you could use string functions to extract the first digit (or a regular expression if the format is not guaranteed to be an appropriate pattern). Then with the first digit, add one, and use it in a very simple regex.

answered Dec 09 '11 at 17:58

Billy Moon

57,113
24
136
237

Shorten a RegEx

5 Answers5