Python regex to extract phone numnber

Question

I would like to clean up the phone number column in my pandas dataframe. I'm using below code but it leaves a bracket at the end. How do I get the right regex to exclude any extra characters in the end like (, or anything which is not part of phone number. I've looked through old posts, but can't seem to find exact solution. sample code below :

import pandas as pd
df1 = pd.DataFrame({'x': ['1234567890', '202-456-3456', '(202)-456-3456adsd', '(202)-456- 4567', '1234564567(dads)']})
df1['x1'] = df1['x'].str.extract('([\(\)\s\d\-]+)',expand= True) 

expected output:
                    x               x1
0          1234567890       1234567890
1        202-456-3456     202-456-3456
2  (202)-456-3456adsd   (202)-456-3456
3     (202)-456- 4567  (202)-456- 4567
4    1234564567(dads)      1234564567

Current output :
                    x               x1
0          1234567890       1234567890
1        202-456-3456     202-456-3456
2  (202)-456-3456adsd   (202)-456-3456
3     (202)-456- 4567  (202)-456- 4567
4    1234564567(dads)      1234564567(

score 5 · Answer 1 · answered Apr 08 '20 at 08:11

5

You may use

((?:\(\d{3}\)|\d{3})?(?:\s|\s?-\s?)?\d{3}(?:\s|\s?-\s?)?\d{4})

See the regex demo

Details

(?:$\d{3}$|\d{3})? - an optional sequence of
- $\d{3}$ - (, three digits, )
- | - or
- \d{3} - three digits
(?:\s|\s?-\s?)? - an optional sequence of a whitespace char or an - enclosed with single optional whitespaces
\d{3} - three digits
(?:\s|\s?-\s?)? - an optional sequence of a whitespace char or an - enclosed with single optional whitespaces
\d{4} - four digits.

Pandas test:

>>> df1['x'].str.extract(r'((?:\(\d{3}\)|\d{3})?(?:\s|\s?-\s?)?\d{3}(?:\s|\s?-\s?)?\d{4})',expand= True)
                 0
0       1234567890
1     202-456-3456
2   (202)-456-3456
3  (202)-456- 4567
4       1234564567

answered Apr 08 '20 at 08:11

Wiktor Stribiżew

607,720
39
448
563

1

Nice solution. Has the advantage of being highly tunable whilst not needing to worry about edge cases in the bits to ignore. I would go with this instead of my solution. +1 – Chris Apr 08 '20 at 08:28
Thanks a lot for the solution, but for the time being I will go with above solution for the simple reason that it looks simpler. – PriyankaJ Apr 08 '20 at 10:17
@PriyankaJ Without any exact requirements, it is up to you to decide. I tried to make *your original* approach work in a safer way, the other solution relies on many assumptions that may be true or not since we do not know what your other data look like. Just suit yourself. – Wiktor Stribiżew Apr 08 '20 at 10:23
@WiktorStribiżew, totally agree with you. Quick question - what does ?: in the expression mean? – PriyankaJ Apr 08 '20 at 12:02
@PriyankaJ It is a [non-capturing group](https://stackoverflow.com/questions/3512471/what-is-a-non-capturing-group-in-regular-expressions). In Pandas, with the `str.extract` and other regex methods, it is useful since it helps avoid creating redundant columns. – Wiktor Stribiżew Apr 08 '20 at 12:04

Chris · Accepted Answer · 2020-04-08T10:04:08.763

2

How about a different approach? Instead of trying to match the phone numbers, remove the bits you don't want:

import pandas as pd
df1 = pd.DataFrame({'x': ['1234567890', '202-456-3456', '(202)-456-3456adsd', '(202)-456- 4567', '1234564567(dads)']})
df1['x1'] = df1['x'].str.replace(r'\([^0-9]+\)|\D*$', '')

Output:
                    x               x1
0          1234567890       1234567890
1        202-456-3456     202-456-3456
2  (202)-456-3456adsd   (202)-456-3456
3     (202)-456- 4567  (202)-456- 4567
4    1234564567(dads)       1234564567

It means using str.replace instead of str.extract but I think the code is simpler as a result.

Explanation: $[^0-9]+$ matches any characters except 0-9 inside parentheses.

| means logical OR.

\D*$ matches zero or more non-numeric characters at the end of the string.

Used with replace, this matches the above pattern and replaces it with an empty string.

edited Apr 08 '20 at 10:04

answered Apr 08 '20 at 08:22

Chris

1,618
13
21

1

Well, but `(202)-456-3456adsd` should be `(202)-456-3456`, shouldn't it? You may probably use `.replace(r'$[^()0-9]+$|\D*$', '')` – Wiktor Stribiżew Apr 08 '20 at 08:27
I suspect this solution would end up hairier than @WiktorStribiżew's byt the time all of the edge cases have been addressed, but it solves the problem given OP's data. – Chris Apr 08 '20 at 08:30
1

@Chris, can you please explain your answer? – PriyankaJ Apr 08 '20 at 10:00
@PriyankaJ sure, good point, I should have included that in the first place. Seems to be one of those days for me! – Chris Apr 08 '20 at 10:04
1

thankyou! I think this should suffice for my requirement – PriyankaJ Apr 08 '20 at 10:11

wwnde · Answer 3 · 2020-04-08T22:16:30.540

0

I would use replace.

df1['x1'] = df1['x'].str.replace(r'(?<=\(\d{3}\)[-]\d{3}[-]\d{4})[a-z]*', '')
df1

Simply put replace Y if it is immediately to the right of X that is (?<+X)Y

Y= group of lower case alphanumerics - [a-z]*

X= three digits between () followed by a dash $\d{3}$[-] followed by; another three digits and a dash $\d{3}$[-] and finally followed by; four digits and a dash `(\d{4})

Output

edited Apr 08 '20 at 22:16

answered Apr 08 '20 at 21:15

wwnde

26,119
6
18
32

Python regex to extract phone numnber

3 Answers3