-1

I have a file that contains phone numbers of the following formats:

(xxx) xxx.xxxx
(xxx).xxx.xxxx
(xxx) xxx-xxxx
(xxx)-xxx-xxxx
xxx.xxx.xxxx
xxx-xxx-xxxx
xxx xxx-xxxx
xxx xxx.xxxx

I must parse the file for phone numbers of those and ONLY those formats, and output them to a separate file. I'm using perl, and so far I have what I think is a valid regex for two of these numbers

my $phone_regex = qr/^(\d{3}\-)?(\(\d{3}\))?\d{3}\-\d{4}$/;

But I'm not sure if this is correct, or how to do the rest all in one regex. Thank you!

  • For difficult regexes it might be helpful to define sub-regexes, e.g. `$three_digits = qr/\d{3}/;` and then combine them to each case you list as another sub-regex and then combine those 8 regexes to the final regex. – Stefan Becker Feb 03 '19 at 19:04

3 Answers3

1

You haven't escaped parenthesis properly and have uselessly escaped hyphen which isn't needed. The regex you are trying to create is this,

^\(?\d{3}\)?[ .-]\d{3}[ .-]\d{4}$

Explanation:

  • ^ -
  • \(? - Optional starting parenthesis (
  • \d{3} - Followed by three digits
  • \)? - Optional closing parenthesis )
  • [ .-] - A single character either a space or . or -
  • \d{3} - Followed by three digits
  • [ .-] - Again a single character either a space or . or -
  • \d{4} - Followed by four digits
  • $ - End of string

Demo

Pushpesh Kumar Rajwanshi
  • 18,127
  • 2
  • 19
  • 36
1

Here you go

\(?\d{3}\)?[-. ]\d{3}[-. ]\d{4}

See a demo on regex101.com.


Broken down this is
\(?   # "(", optional
\d{3} # three digits
\)?   # ")", optional
[-. ] # one of "-", "." or " "
\d{3} # three digits
[-. ] # same as above
\d{4} # four digits

If you want, you can add word boundaries on the right site (\b), some potential matches may be filtered out then.

Jan
  • 42,290
  • 8
  • 54
  • 79
1

Your current regex allows too much, as it will allow xxx-(xxx) at the beginning. It also doesn't handle any of the . or space separated cases. You want to have only three sets of digits, and then allow optional parentheses around the first set which you can use an alternation for, and then you can make use of character classes to indicate the set of separators you want to allow.

Additionally, don't use \d as it will match any unicode digit. Since you likely only want to allow ASCII digits, use the character class [0-9] (there are other options, but this is the simplest).

Finally, $ allows a newline at the end of the string, so use \z instead which does not. Make sure if you are reading these from a file that you chomp them so they do not contain trailing newlines.

This leaves us with:

qr/^(?:[0-9]{3}|\([0-9]{3}\))[-. ][0-9]{3}[-.][0-9]{4}\z/

If you want to ensure that the two separators are the same if the first is a . or -, it is easiest to do this in multiple regex checks (these can be more lenient since we already validated the general format):

if ($str =~ m/^[0-9()]+ /
    or $str =~ m/^[0-9()]+\.[0-9]{3}\./
    or $str =~ m/^[0-9()]+-[0-9]{3}-/) {
    # allowed
}
Grinnz
  • 9,093
  • 11
  • 18
  • Also, when the first delimiter is . or -, needs to restrict the second delimiter to that – ysth Feb 03 '19 at 19:10
  • @ysth I fixed the first by using an alternation instead, and added an option for checking the second (while it's possible to do in one regex, it would be unnecessarily complex) – Grinnz Feb 03 '19 at 19:17