2

I'm looking for a regex to use in Java (java.util.regex.Pattern) that will match a generalised form of a telephone number. I've specified this as being:

a sequence of at least 8 non-letter characters with at least 8 characters being digits.

For example of a string literal with a positive match would be:

"Tel: (011) 1234-1234 blah blah blah"

however the following string literal would not match:

"Fot 3 ..... a 3 blah blah blah"

I've got as far as matching a sequence of at least 8 non-letter characters

Pattern.compile("[^\\p{L}]{8,}");

How can I add an "and" / "conjuncive restriction" onto that regex specifying [\d]{8,}

I saw this post on stackoverflow:

Regular Expressions: Is there an AND operator?

About "anding" regex expressions but I can't seem to get it to work.

Any help or suggestions, very welcome:

Simon

Community
  • 1
  • 1
Simon B
  • 1,784
  • 3
  • 21
  • 26
  • 1
    A regex to match a phone number is very tricky. You're much better off writing a scanner/parser to do this. You will get much better coverage and fewer false positives. – Richard H Mar 11 '11 at 16:14
  • If you have a problem that can be resolved by the use of a regular expression, you now have two problems :) – DaveH Mar 11 '11 at 16:16

4 Answers4

3

If you are searching for phone numbers in unstructured documents, ie where the phone numbers could be expressed in any number of ways (with or without intl prefixes, brackets around area codes, dashes, a variable number of digits, randomly split with white space etc), and where you might well get lots of numbers that naively look like phone numbers but aren't (e.g on the web), forget using a regex, seriously.

You are much better off writing your own parser. Basically this steps through your text one character at a time, and you can add any rules you like to it. This is approach also makes it much easier to match against actual real phone numbers (e.g valid international or area codes, or other rules local or national exchanges may have) and so cut down on false positives. I know from doing this myself matching UK numbers across over a million buiness websites: a general regex for 10 or 11 digits plus some other basic rules match against an unbelievable number of non-phone numbers.

Edit: also if you're matching against web documents, you've also got the problem of phone numbers not being contiguous free text but containing html markup. It happens :)

Richard H
  • 38,037
  • 37
  • 111
  • 138
2

^(?=(?:.*[^\\p{L}\\d]){8,})(?=(?:.*\\d){8,}) if non-letter can't be a digit

^(?=(?:.*\\P{L}){8,})(?=(?:.*\\d){8,}) if non-letter can be a digit

edit: commented/exclude whitespace modifier /x

if non-letter can't be a digit

^                          # beginning of string
     (?=                         # Start look ahead assertion (consumes no characters)
          (?:                       # Start non-capture group
              .*                        # 0 or more anychar (will backtrack to match next char)
              [^\pL\d]                  # character: not a unicode letter nor a digit
          ){8,}                     # End group, do group 8 or more times
     )                           # End of look ahead assertion
     (?=                         # Start new look ahead (from beginning of string)
          (?:                        # Start grouping
              .*                         # 0 or more anychar (backtracks to match next char)
              \d                         # a digit
          ){8,}                      # End group, do 8 or more times (can be {8,}? to minimize match)
     )                           # End of look ahead

if non-letter can be a digit

^                       # Same form as above (except where noted)
    (?=                 #  ""
         (?:            #  ""
             .*         
             \PL        # character: not a unicode letter
         ){8,}
    )
    (?=
         (?:
             .*
             \d
         ){8,}
    )
  • Who puts braces around a simple `\pL` or `\PL`? That makes those a lot longer to type and messier to read. Since Java people seem never to “bother” with `Pattern.COMMENTS`, they need all the help they can get. – tchrist Mar 11 '11 at 21:22
  • 1
    @tchrist - annotations are included. hows that? –  Mar 11 '11 at 23:37
0

I would do it without using regular expressions. The non-regex code would be simple enough.

NPE
  • 486,780
  • 108
  • 951
  • 1,012
-1

How about something like this:

import java.util.regex.*;

class Test {
    public static void main(String args[]) {
        for (String tel : new String[]{
            "Tel: (011) 1234-1234 blah blah blah",
            "Tel: (011) 123-1 blah blah blah"
        }) {
            System.err.println(tel + " " + (test(tel) ?
                "matches" : "doesn't match"));
        }
    }

    public static boolean test(String tel) {
        return Pattern.compile("^(\\D*(\\d+?)\\D*){8,}$").matcher(tel).matches();
    }
}

will produce:

Tel: (011) 1234-1234 blah blah blah matches
Tel: (011) 123-1 blah blah blah doesn't match
ziizii
  • 1