Decoding a regex... I know what it's function is but I want to understand exactly what is happening

Question

I have a regular expression that I'm going to be using to verify that an inputted number is in standard U.S. telephone format (i.e (###) ###-####). I am new to regex and still having some trouble figuring out the exact function of each character. If someone would go through this piece by piece/verify that I am understanding I would really appreciate it. Also if the regex is wrong I would obviously like to know that.

\D*?(\d\D*?){10}

What I think is happening:
\D*?( indicates an escape sequence for the parenthesis metacharacter... not sure why the \D*? is necessary
\d indicating digits
\D*? indicating there is a non-digit character (-) followed by the closing parenthesis.
{10} for the 10 digits

I feel very unsure explaining this, like my understanding is very vague in terms of why the regex is in the order that it is etc. Thanks in advance for help/explanations.

EDIT

It seems like this is not the best regex for what I want. Another possibility was [(][0-9]{3}[)] [0-9]{3}-[0-9]{4}, but I was told this would fail. I suppose I'll have to do a little more work with regular expressions to figure this out.

Possible Duplicate: http://stackoverflow.com/questions/123559/a-comprehensive-regex-for-phone-number-validation — David Starkey, Jul 09 '13 at 17:25
It's an *extremely* broad regex for your requirements - it matches all kinds of input - e.g. `a5bcdefgh55a55c5d5g5!5q5` will match. In simple terms, all it requires is 10 digits in a string - that is, the input can have any number of (including 0) random characters between the digits, as well as at the beginning and end of the string. Adding this as a comment, because the answer here should really be a better solution for the problem. — JimmiTh, Jul 09 '13 at 17:32
@500-InternalServerError: wow, debuggex is amazing. Thanks for linking to it. OP: You might try an expression like this one: http://www.debuggex.com/r/MSYZS_GBTT3UdNM9/0 — Wug, Jul 09 '13 at 17:41
agreed that debuggex is very useful. I've come up with $\d{3}$\d{3}[-]\d{4}$ which seems to work alright. — Jsh, Jul 09 '13 at 17:57

Jerry · Answer 1 · 2013-07-09T18:26:50.190

\D matches any non-digit character.

* means that the previous character is repeated 0 or more times.

*? means that the previous character is repeated 0 or more times, but until the match of the following character in the regex. It is a bit difficult perhaps at the start, but in your regex, the next character is \d, meaning \D*? will match the least amount of characters until the next \d character.

( ... ) is a capture group, and is also used to group things. For instance {10} means that the previous character or group is repeated 10 times exactly.

Now, \D*?(\d\D*?){10} will match exactly 10 numbers, starting with non-digit characters or not, with non-digit characters in between the digits if they are present.

[(][0-9]{3}[)] [0-9]{3}-[0-9]{4}

This regex is a bit better since it doesn't just accept anything (like the first regex does) and will match the format (###) ###-#### (notice the space is a character in regex!).

The new things introduced here are the square brackets. These represent character classes. [0-9] means any character between 0 to 9 inclusive, which means it will match 0, 1, 2, 3, 4, 5, 6, 7, 8 or 9. Adding {3} after it makes it match 3 similar character class, and since this character class contains only digits, it will match exactly 3 digits.

A character class can be used to escape certain characters, such as ( or ) (note I mentioned earlier they are for capturing groups, or grouping) and thus, [(] and [)] are literal ( and ) instead of being used for capturing/grouping.

You can also use backslashes (\) to escape characters. Thus:

\([0-9]{3}\) [0-9]{3}-[0-9]{4}

Will also work. I would also recommend the use of line anchors ^ and $ if you're only trying to see if a phone number matches the above format. This ensures that the string has only the phone number, and nothing else. ^ matches the beginning of a line and $ matches the end of a line. Thus, the regex will become:

^\([0-9]{3}\) [0-9]{3}-[0-9]{4}$

However, I don't know all the combinations of the different formats of phone numbers in the US, so this regex might need some tweaking if you have different phone number formats.

+1 for explanation *and* better solution. Worth noting that in this case it is indeed a better idea to use `[0-9]` rather than `\d`, since using the latter would also allow this input: "(٣௨౮) ໙੩૪-୭༤૯௦" - should anyone decide to test your validation thoroughly... ;-) In other words, `\d` matches *any* digit unicode character from Hebrew to Chinese, while the character class only allows the specified roman numerals. — JimmiTh, Jul 09 '13 at 19:24
@JimmiTh That's why I didn't change the `[0-9]` back to `/d` :) But on another hand, I didn't want to explicitly mention this yet, since there's already quite a lot to assimilate at once! — Jerry, Jul 09 '13 at 19:25

score 2 · Answer 2 · answered Jul 09 '13 at 17:29

\D is "not a digit"; \d is "digit". With that in mind:

This matches zero or more non-digits, then it matches a digit and any number of non-digit characters 10 times. This won't actually verify that the number if formatted properly, just that it contains 10 digits. I suspect that the regex isn't what you want in the first place.

For example, the following will match your regex:

this is some bad text 1 and some more 2 and more 34567890

score 0 · Answer 3 · answered Jul 09 '13 at 17:34

0

\D matches a character that is not a digit * repeats the previous item 0 or more times ? find the first occurrence \d matches a digit

so your group is matches 10 digits or non digits

answered Jul 09 '13 at 17:34

Casey Sebben

677
6
14

Decoding a regex... I know what it's function is but I want to understand exactly what is happening

3 Answers3