0

Possible Duplicate:
Regular expression to match US phone numbers

I need to find phone numbers in html, i have seen many examples here and on google but not sure why i cannot get any one to work , it simply wont find the number .Suppose html is :

  1. example site 1 for phone number
  2. example site 2 for phone number

Basically i was going for all US pattern phone numbers, but any thing i found i used it but no luck i am using this code:

CODE: public static string Extractphone(string html) { StringBuilder sb = new StringBuilder();

    try
    {
        List<string> tmpemail = new List<string>();
        string data = html; 
        //instantiate with this pattern 
        Regex emailRegex = new Regex(@"(\\d{3})-(\\d{3})-(\\d{4})",
            RegexOptions.IgnoreCase);
        //find items that matches with our pattern
        MatchCollection emailMatches = emailRegex.Matches(data);

        foreach (Match emailMatch in emailMatches)
        {
            if (!tmpemail.Contains(emailMatch.Value.ToLower()))
            {
                sb.AppendLine(emailMatch.Value.ToLower());

                tmpemail.Add(emailMatch.Value.ToLower());
            }
          //  (541) 708-1364
        }
        //store to file
    }
    catch (Exception ex)
    {
    }
    return sb.ToString();
}

I have changed the pattern many times from many examples but no luck.

Community
  • 1
  • 1
confusedMind
  • 2,573
  • 7
  • 33
  • 74

3 Answers3

0

You are ignoring the escape sequences with your '@' sign, and then using \\ to escape the backslash character.

Remove either the extraneous backslashes, or the @ sign, coz your regex looks right for a basic US phone number.

See here: A comprehensive regex for phone number validation for the standard SO answer, and here: http://regexlib.com/Search.aspx?k=US%20Phone%20number for a good regex site, if you haven't seen them yet.

Community
  • 1
  • 1
mcalex
  • 6,628
  • 5
  • 50
  • 80
0
  • You are using a string literal so your '\\' is not escaping the backslash. Just removing the extra slash will get you to match your first case
  • To handle multiple cases you have to put those multiple cases into the regex. Since you might have a leading parenthsis you have to check for it by having \(?. The same with the trailing one you may have that and 0+ spaces or the dash so you need to check the or case so instead of just - you need (\)\s*|-)
  • You don't need parens around the \d{3} or\d{4} groups as it is a single match. That is probably just making the expression harder to read and understand

So that leaves you with the following for your Regex initialization

Regex emailRegex = new Regex(@"\(?\d{3}(\)\s*|-)\d{3}-\d{4}",
            RegexOptions.IgnoreCase);

I haven't tested this robustly but I think that works.

As a side note regular expressions are one of those things that are really cryptic if you don't understand them. Trying to just take someone else's expression and use it can give poor results if you don't actually understand what is being checked for in the expression. Also what I wrote there is not comprehensive. It would only be useful in those two cases. To be able to handle any phone number the expression quickly gets much more complicated.

Craig Suchanec
  • 10,474
  • 3
  • 31
  • 39
0

Try this regex

(?:\(\d{3}\)\s*|\d{3}-)\d{3}-\d{4}

explain:

(?:subexpression) Defines a noncapturing group.

\d Matches any decimal digit.

| Matches any one element separated by the vertical bar | character.

and a sample code:

var results = Regex.Matches(strInput, @"(?:\(\d{3}\)\s*|\d{3}-)\d{3}-\d{4}");

but note that:

Verbatim string literals start with @ and are also enclosed in double quotation marks. For example:

@"c:\Docs\Source\a.txt" // rather than "c:\\Docs\\Source\\a.txt"

and

@"(\\d{3})-(\\d{3})-(\\d{4})"
rather than
@"(\\\\d{3})-(\\\\d{3})-(\\\\d{4})"

Ria
  • 10,237
  • 3
  • 33
  • 60