0

I want to write code that can parse American phone numbers (ie. "(664)298-4397") . Below are the constraints:

  • allow leading and trailing white spaces
  • allow white spaces that appear between area code and local numbers
  • no white spaces in area code or the seven digit number XXX-XXXX

Ultimately I want to print a tuple of strings (area_code, first_three_digits_local, last_four_digits_local)

I have two sets of questions.

Question 1: Below are inputs my code should accept and print the tuple for:

  • '(664) 298-4397', '(664)298-4397', ' (664) 298-4397'

Below is the code I tried:

regex_parse1 = re.match(r'^([\s]*[(]*[0-9]*[)]*[\s]*)+([\s]*[0-9]*)-([0-9]*[\s]*)$', '(664) 298-4397')
print (f' groups are: {regex_parse1.groups()} \n')

regex_parse2 = re.match(r'^([\s]*[(]*[0-9]*[)]*[\s]*)+([\s]*[0-9]*)-([0-9]*[\s]*)$', '(664)298-4397')
print (f' groups are: {regex_parse2.groups()} \n')

regex_parse3 = re.match(r'^([\s]*[(]*[0-9]*[)]*[\s]*)+([\s]*[0-9]*)-([0-9]*[\s]*)$', '   (664)      298-4397')
print (f' groups are: {regex_parse3.groups()}')     

The string input for all three are valid and should return the tuple:

('664', '298', '4397')

But instead I'm getting the output below for all three:

groups are: ('', '', '4397')   

What am I doing wrong?

Question 2: The following two chunks of code should output an 'NoneType' object has no attribute 'group' error because the input phone number string violates the constraints. But instead, I get outputs for all three.

regex_parse4 = re.match(r'^([\s]*[(]*[0-9]*[)]*[\s]*)+([\s]*[0-9]*)-([0-9]*[\s]*)$', '(404)555 -1212')
print (f' groups are: {regex_parse4.groups()}')

regex_parse5 = re.match(r'^([\s]*[(]*[0-9]*[)]*[\s]*)+([\s]*[0-9]*)-([0-9]*[\s]*)$', ' ( 404)121-2121')
print (f' groups are: {regex_parse5.groups()}')

Expected output: should be an error but I get this instead for all three:

groups are: ('', '', '2121')

What is wrong with my regex code?

oguz ismail
  • 1
  • 16
  • 47
  • 69
PineNuts0
  • 4,740
  • 21
  • 67
  • 112

2 Answers2

1

Try:

regex_parse4 = re.match(r'([(]*[0-9]{3}[)])\s*([0-9]{3}).([0-9]{4})', number)

Assumes 3 digit area code in parentheses, proceeded by XXX-XXXX.

Python returns 'NoneType' when there are no matches.

If above does not work, here is a helpful regex tool: https://regex101.com


Edit:

Another suggestion is to clean data prior to applying a new regex. This helps with instances of abnormal spacing, gets rid of parentheses, and '-'.

clean_number = re.sub("[^0-9]", "", original_number)
   
regex_parse = re.match(r'([0-9]{3})([0-9]{3})([0-9]{4})', clean_number)

print(f'groups are: {regex_parse}.groups()}')

>>> ('xxx', 'xxx', 'xxxx')
nahar
  • 41
  • 5
1

In general, your regex overuse the asterisk *. Details as follows:

You have 3 capturing groups:

  1. ([\s]*[(]*[0-9]*[)]*[\s]*)
  2. ([\s]*[0-9]*)
  3. ([0-9]*[\s]*)

You use asterisk on every single item, including the open and close parenthesis. Actually, almost everything in your regex is quoted with asterisk. Thus, the capturing groups match also null strings. That's why your first and second capturing groups return the null strings. The only item you don't use asterisk is the hyphen sign - just before the third capturing group. This is also the reason why your regex can capture the third capturing group as in the 4397 and 2121

To solve your problem, you have to use asterisk only when needed.

In fact, your regex still has plenty of rooms for improvement. For example, it now matches numeric digits of any length (instead of 3 or 4 digits chunks). It also allows the area code not enclosed in parenthesis (because of your use of asterisk around parenthesis symbols.

For this kind of common regex, I suggest you don't need to reinvent the wheel. You can refer to some already made regex easily found from the Internet. For example, you can refer to this post Although the post is using javascript instead of Python, the regex is just similar.

SeaBean
  • 22,547
  • 3
  • 13
  • 25