0

I'm experiencing problems with a regex that parses custom phone numbers:

  1. A value matching "wtvCode" group is optional;
  2. A value matching "countryCode" group is optional;
  3. The countryCode rule overlaps with areaCityCode rule for some values. In such cases, when countryCode is missing, its expression captures the areaCityCode value instead.

Code example is below.

Regex regex = new Regex(string.Concat(
    "^(",
    "(?<wtvCode>[A-Z]{3}|)",
    "([-|/|#| |]|)",
    "(?<countryCode>[2-9+]{2,5}|)",
    "([-|/|#| |]|)",
    "(?<areaCityCode>[0-9]{2,3}|)",
    "([-|/|#| |]|))",
    "(?<phoneNumber>(([0-9]{8,18})|([0-9]{3,4}([-|/|#| |]|)[0-9]{4})|([0-9]{4}([-|/|#| |]|)[0-9]{4})|([0-9]{4}([-|/|#| |]|)[0-9]{4}([-|/|#| |]|)[0-9]{1,5})))",
    "([-|/|#| |]|)",
    "(?<foo>((A)|(B)))",
    "([-|/|#| |]|)",
    "(?<bar>(([1-9]{1,2})|)",
    ")$"
));

string[] validNumbers = new[] {
    "11-1234-5678-27-A-2",   // missing wtvCode and countryCode
    "48-1234-5678-27-A-2",   // missing wtvCode and countryCode
    "55-48-1234-5678-27-A-2" // missing wtvCode 
};

foreach (string number in validNumbers) {
    Console.WriteLine("countryCode: {0}", regex.Match(number).Groups["countryCode"].Value);
    Console.WriteLine("areaCityCode: {0}", regex.Match(number).Groups["areaCityCode"].Value);
    Console.WriteLine("phoneNumber: {0}", regex.Match(number).Groups["phoneNumber"].Value);
}

The output for that is:

// First number
// countryCode:               <- correct
// areaCityCode: 11           <- correct, but that's because "11" is never a countryCode
// phoneNumber: 1234-5678-27  <- correct

// Second number
// countryCode: 48            <- wrong, should be ""
// areaCityCode:              <- wrong, should be "48"
// phoneNumber: 1234-5678-27  <- correct

// Third number
// countryCode: 55            <- correct
// areaCityCode: 48           <- correct
// phoneNumber: 1234-5678-27  <- correct

I've failed so far on fixing this regular expression in a way that it covers all my constraints and doesn't mess with countryCode and areaCityCode when a value match both rules. Any ideas?

Thanks in advance.


Update

The correct regex pattern for phone country codes can be found here: https://stackoverflow.com/a/6967885/136381

Community
  • 1
  • 1
Hilton Perantunes
  • 576
  • 2
  • 15
  • 24

2 Answers2

2

First I recommend using the ? quantifier to make things optional instead of the empty alternatives you're using now. And in the case of the country code, add another ? to make it non-greedy. That way it will try initially to capture the first bunch of digits in the areaCityCode group. Only if the overall match fails will it go back and use the countryCode group instead.

Regex regex = new Regex(
    @"^
    ( (?<wtvCode>[A-Z]{3}) [-/# ] )?
    ( (?<countryCode>[2-9+]{2,5}) [-/# ] )??
    ( (?<areaCityCode>[0-9]{2,3}) [-/# ] )?
    (?<phoneNumber> [0-9]{8,18} | [0-9]{3,4}[-/# ][0-9]{4}([-/# ][0-9]{1,5})? )
    ( [-/# ] (?<foo>A|B) )
    ( [-/# ] (?<bar>[1-9]{1,2}) )?
    $",
  RegexOptions.IgnorePatternWhitespace | RegexOptions.ExplicitCapture);

As you can see, I've made a few other changes to your code, the most important being the switch from ([-|/|#| |]|) to [-/# ]. The pipes inside the brackets would just match a literal |, which I'm pretty sure you don't want. And the last pipe made the separator optional; I hope they don't really have to be optional, because that would make this job a lot more difficult.

Alan Moore
  • 73,866
  • 12
  • 100
  • 156
1

There are two things overlooked by yourself and the other responders.

The first is that it makes more sense to work in reverse, in other words right to left because there are more required fields to the end of the text than at the begininning. By removing the doubt of the WTV and the Country code it becomes much easier for the regex parser to work (though intellectually harder for the person writting the pattern).

The second is the use of the if conditional in regex (? () | () ). That allows us to test out a scenario and implement one match pattern over another. I describe the if conditional on my blog entitled Regular Expressions and the If Conditional. The pattern below tests out whether there is the WTV & Country, if so it matches that, if not it checks for an optional country.

Also instead of concatenating the pattern why not use IgnorePatternWhitespace to allow the commenting of a pattern as I show below:

string pattern = @"
^
(?([A-Z][^\d]?\d{2,5}(?:[^\d]))  # If WTV & Country Code (CC)
  (?<wtvCode>[A-Z]{3})           # Get WTV & CC
  (?:[^\d]?)
  (?<countryCode>\d{2,5})
  (?:[^\d])                    # Required Break
  |                            # else maybe a CC
  (?<countryCode>\d{2,5})?     # Optional CC
  (?:[^\d]?)                   # Optional Break
 )
(?<areaCityCode>\d\d\d?)        # Required area city
(?:[^\d]?)                      # Optional break (OB)
(?<PhoneStart>\d{4})            # Default Phone # begins
(?:[^\d]?)                      # OB
(?<PhoneMiddle>\d{4})           # Middle
(?:[^\d]?)                      # OB
(?<PhoneEnd>\d\d)               # End
(?:[^\d]?)                      # OB
(?<foo>[AB])                    # Foo?
(?:[^AB]+)
(?<bar>\d)
$
";

    var validNumbers = new List<string>() {
    "11-1234-5678-27-A-2",   // missing wtvCode and countryCode
    "48-1234-5678-27-A-2",   // missing wtvCode and countryCode
    "55-48-1234-5678-27-A-2", // missing wtvCode
    "ABC-501-48-1234-5678-27-A-2" // Calling Belize (501)
};

    validNumbers.ForEach( nm =>
                {
                    // IgnorePatternWhitespace only allows us to comment the pattern; does not affect processing
                    var result = Regex.Match(nm, pattern, RegexOptions.IgnorePatternWhitespace | RegexOptions.RightToLeft).Groups;

                    Console.WriteLine (Environment.NewLine + nm);
                    Console.WriteLine("\tWTV code    : {0}", result["wtvCode"].Value);
                    Console.WriteLine("\tcountryCode : {0}", result["countryCode"].Value);
                    Console.WriteLine("\tareaCityCode: {0}", result["areaCityCode"].Value);
                    Console.WriteLine("\tphoneNumber : {0}{1}{2}", result["PhoneStart"].Value, result["PhoneMiddle"].Value, result["PhoneEnd"].Value);

                }
    );

Results:

11-1234-5678-27-A-2
  WTV code    : 
  countryCode : 
  areaCityCode: 11
  phoneNumber : 1234567827

48-1234-5678-27-A-2
  WTV code    : 
  countryCode : 
  areaCityCode: 48
  phoneNumber : 1234567827

55-48-1234-5678-27-A-2
  WTV code    : 
  countryCode : 55
  areaCityCode: 48
  phoneNumber : 1234567827

ABC-501-48-1234-5678-27-A-2
  WTV code    : ABC
  countryCode : 501
  areaCityCode: 48
  phoneNumber : 1234567827

Notes:

  • If there is no divider between the country code and the city code, there is no way a parser can determine what is city and what is country.
  • Your original country code pattern failed [2-9] failed for any country with a 0 in it. Hence I changed it to [2-90].
ΩmegaMan
  • 29,542
  • 12
  • 100
  • 122
  • Well pointed, thanks. Also, [2-90] doesn't work for phone country codes either. I ended up using the pattern described here: http://stackoverflow.com/a/6967885/136381 – Hilton Perantunes Mar 30 '12 at 18:17