-2

I have two regex. Both matches American Date Formats. Here there are (I highlight the group I talk about):

^(.*?)((0|1)?\d)-((0|1|2|3)?\d)-**(19|20\d\d)**(.*?)$

^(.*?)((0|1)?\d)-((0|1|2|3)?\d)-**((19|20)\d\d)**(.*?)$

Both matches:

asasa12-12-1993.txt
asassa12-12-2010.txt

In the book he put 19|20 into its own group. Why?

The fourth bird
  • 154,723
  • 16
  • 55
  • 70
  • I think the `**` is not a valid notation. Note that you can also write `([0-3])?` using a character class instead of an alternation `(0|1|2|3)?` – The fourth bird Sep 07 '20 at 07:22

2 Answers2

0

My best guess is it's easier for humans to parse.

The first ((19|20\d\d)) doesn't make it obvious whether the alternation is "19 or 20\d\d", whereas in ((19|20)\d\d) it's obvious to see it's "19 or 20, then \d\d".

AKX
  • 152,115
  • 15
  • 115
  • 172
0

AKX is almost right but it's more than that.

19|20\d\d will match either 19 OR 20 followed by 2 digits.

But it will not match 19 followed by 2 digits.

Have a look here: https://regex101.com/r/lvYGUb/3

You'll see 2010 is a single match whereas 19 is matched alone, without the 93, and as a consequence the 93 goes with the .txt group, which is probably not what you want

In a similar way, consider this data file :

20 euros
20 €

Let's say you want to match 100% of both lines using a regex.

\d+ euros|€ won't work because it means either a number followed by the word euros OR just the € sign alone

But

\d+ (euros|€) will work

So the purpose of the parentheses here is not capturing the group, they are just meant to put a boundary to the OR operator.

If you don't want those parentheses to capture the group, you can add ?: to make it a non-capturing group, like so:

^(.*?)((0|1)?\d)-((0|1|2|3)?\d)-((?:19|20)\d\d)(.*?)$
Vincent
  • 3,945
  • 3
  • 13
  • 25