2

I use the following regex expression to find a phone in a string:

([0-9]{3})\)?[-. ]?([0-9]{3})[-. ]?([0-9]{4})

It works great on numbers like:

555-555-5555 (555)555-5555 (555) 555-5555

However, if there's an extra space inside the string it does not find the phone. 555 -555-5555 (555)555- 5555 (555) 555 -5555

Can it be modified to allow for a space or two? My input comes from OCR and not user input so I can't require formatted input.

Thanks.

santa
  • 12,234
  • 49
  • 155
  • 255
  • If you add a literal space followed by a star (match 0-infinity times) before both `[-. ]` it should work fine. – h2ooooooo Nov 01 '17 at 18:42
  • *"My input comes from OCR"* - Side note: That could pose problems, where certain characters may not get interpreted correctly. Just be careful with that. – Funk Forty Niner Nov 01 '17 at 18:42
  • `{0,2}` would be zero to two of the preceding group/character. Perhaps https://regex101.com/r/0ZSkjW/1/ would do it, assuming `--` and/or `.-` as separators would also be valid. – chris85 Nov 01 '17 at 18:50

4 Answers4

1

As per your examples your could use

[(\d](?:(?!\h{2,})[-\d()\h])*\d

See a demo on regex101.com.


That is
[(\d]          # one of ( or 0-9
(?:            # a non-capturing group
    (?!\h{2,}) # make sure not 2+ horizontal whitespaces are immediately ahead
    [-\d()\h]  # then match one of -, 0-9, () or whitespaces
)*             # zero or more times
\d             # the end must be a digit

It is a variation of the tempered greedy token.


In PHP this could be
<?php
$data = <<<DATA
555-555-5555   (555)555-5555    (555) 555-5555

However, if there\'s an extra space inside the string it does not find the phone. 555 -555-5555   (555)555- 5555    (555) 555 -5555
DATA;

$regex = '~[(\d](?:(?!\h{2,})[-\d()\h])*\d~';

preg_match_all($regex, $data, $matches);
print_r($matches);
?>

Which yields

Array
(
    [0] => Array
        (
            [0] => 555-555-5555
            [1] => (555)555-5555
            [2] => (555) 555-5555
            [3] => 555 -555-5555
            [4] => (555)555- 5555
            [5] => (555) 555 -5555
        )

)
Jan
  • 42,290
  • 8
  • 54
  • 79
  • Wow, I like that regex101.com! Will definitely use it more often. Script worked great but it also picked up dates, eg: 03-12-2016 can it be fine-tuned further? – santa Nov 01 '17 at 20:37
  • @santa: Of course, you could skip these: https://regex101.com/r/AEyAq1/3 – Jan Nov 02 '17 at 09:07
0

If I understood, you want use only regexp, so you can add \s* in each pattern group like

([0-9]{3})\)?\s*[-. ]?\s*([0-9]{3})\s*[-. ]?\s*([0-9]{4})\s*

This is based on your request script

Here a DEMO EXAMPLE

Oscar Zarrus
  • 790
  • 1
  • 9
  • 17
0

I feel like you are asking for a very lenient / inclusive pattern.

This one is pretty forgiving: /\(?\d{3}\)? {0,2}[-.]? {0,2}\d{3} {0,2}[-.]? {0,2}\d{4}/

Pattern Demo Link

It will match all of these variants (...and more):

555-555-5555
(555)555-5555
(555) 555-5555
555 -555-5555
(555)555- 5555
(555) 555 -5555
555.555-5555
555.555.5555
5555555555
555-555.5555
(555)5555555
(555).555.5555
(555)-555-5555
(555555-5555
555)-555-5555
555555-5555
555 5555555
555 555 5555
555 - 555 - 5555
555555  .  5555

The pattern logic is in this order:

  • permit an optional (.
  • require 3 digits
  • permit an optional )
  • permit zero, one, or two literal spaces
  • permit an optional hyphen or dot
  • permit zero, one, or two literal spaces
  • require 3 digits
  • permit zero, one, or two literal spaces
  • permit an optional hyphen or dot
  • permit zero, one, or two literal spaces
  • require 4 digits
mickmackusa
  • 43,625
  • 12
  • 83
  • 136
0

To limit the number of added spaces, you can check the position of the first digit of the last group (you can also choose the last digit). Then all you have to do is to describe the different separators the way you want.

~[(\d](?:\b\d{3}\)(?=.{3,5}\W\b) {0,2}\d{3}|\B\d{2}(?=.{4,6}\W\b)(?:- ?| -? ?)\d{3})(?:- ?| -? ?)\d{4}\b~

demo

The same pattern in more readable:

~
[(\d]  # first character discrimination technic (avoid the cost of an alternation
       # at the start of the pattern)
(?: # with brackets
    \b \d{3} \)
    (?= .{3,5} \W \b )
    \g<spb> \d{3}
  | # without brackets
    \B \d{2} # you can also replace \B with (?<=\b\d) to check the word-boundary
    (?= .{4,6} \W \b )
    \g<sp> \d{3}
)
\g<sp> \d{4} \b

# subpattern definitions:
(?<spb> [ ]{0,2} ){0}             # separator after bracket
(?<sp> - [ ]? | [ ] -? [ ]? ){0}  # other separators
~x

demo

Feel free to change - to [.-] or to define your own allowed separators. Don't forget in this case to change also the quantifiers in the lookaheads. Also, if you want to allow the second separator to be empty, check the boundary after the last digit instead of the boundary before first digit of the last group.

Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125