0

I am not looking for a regex to match phone numbers. This is simply my use case. I want to know why my regex isn't including an optional non-matching group within the capture.

To better illuminate my specific use case, a bit of an introduction. I am trying to match phone numbers. I have a working regex with the exception of when an extension is used.

My regex (a bit long, but comprehensive):

((?:\+{0,2}\d{1,3})?[-.()\/* ]*?\d{3}[-.()\/* ]*?\d{3}[-.()\/* ]*?\d{4}[-.()\/* ]*?(?:(?:x|ext)[:]?[ ]*\d+)?)

A shortened version to illustrate my issue:

(\d{4}[-.()\/* ]*?(?:(?:x|ext)[:]?[ ]*\d+)?)

Where:

(...) is my capture group

\d{4} four digits

[-.()\/* ]*? various separators 0-infinite times (non-greedy)

(?:...) non-capture group

x|ext extension identifier

[:]? ":" 0-1 time

[ ]* " " 0-infinite times

\d+ digit 1-infinite times

(?:...)? non-capture group 0-1 time

So 1234 ext 567 should match, but only 1234 does

Regex101 link: regex101.com/r/NRQhTl/1

If I remove the ?, to make the group not optional it works just fine:

(\d{4}[-.()\/* ]*?(?:(?:x|ext)[:]?[ ]*\d+))

It seems like the ? is making it lazy but then also won't match numbers that do not have an extension.

Any help or insights would be greatly appreciated

  • Where do scala and lazy-evaluation come into play? As far as I can tell this is strictly a regex question. – Ethan Jul 03 '18 at 19:40
  • `*?` is a *lazy* regex. If I understand it correctly, you want just `*` after `[...]`, not a `*?`. The tag `lazy-evaluation` is incorrect, it has nothing to do with it. – Andrey Tyukin Jul 03 '18 at 19:41
  • @emsimpson92 I don't think this is a duplicate. While I am trying to match phone numbers, my question isn't about finding a regex to match phone numbers. Rather, why is the extension non-matching group acting lazy. It's far more specific – Randomness Slayer Jul 03 '18 at 19:41
  • @AndreyTyukin you are correct, lazy regex is what I meant – Randomness Slayer Jul 03 '18 at 19:43
  • It's not the extension group acting lazily. It's the `[-.()\/* ]*?` that matches zero characters and then gives up, because it's "good enough". – Andrey Tyukin Jul 03 '18 at 19:43
  • @AndreyTyukin that makes sense. I that hadn't occurred to me. Thanks! – Randomness Slayer Jul 03 '18 at 19:47

1 Answers1

1

If you remove the lazy *? quantifier after the separator-symbols, then it seems to work just fine:

(\d{4}[-.()\/* ]*(?:(?:x|ext)[:]?[ ]*\d+)?)

Demo: regex101.

The reason why your foo[bar]*?(?:extension)?-regex stops matching immediately after foo is because the *? quantifier forces it to stop as early as possible, matching exactly zero characters from bar and then skipping the non-capturing extension group.

You might also consider moving the [-.()\/* ]* part into the (?: ... )? as well, because otherwise it will match periods that aren't followed by a proper extension.

I'm not sure what you tried with the () there, to be honest: is it really supposed to match 1234) ext 5678?

Andrey Tyukin
  • 43,673
  • 4
  • 57
  • 93
  • The parans are for use primarily for the area code and country code. For the shortened example they are extraneous. I build the regex using components, and that piece is my generic "separator". It was lazy exactly for that reason, so I may need to be less generic – Randomness Slayer Jul 03 '18 at 20:03