0

I would like to clean up the phone number column in my pandas dataframe. I'm using below code but it leaves a bracket at the end. How do I get the right regex to exclude any extra characters in the end like (, or anything which is not part of phone number. I've looked through old posts, but can't seem to find exact solution. sample code below :

import pandas as pd
df1 = pd.DataFrame({'x': ['1234567890', '202-456-3456', '(202)-456-3456adsd', '(202)-456- 4567', '1234564567(dads)']})
df1['x1'] = df1['x'].str.extract('([\(\)\s\d\-]+)',expand= True) 

expected output:
                    x               x1
0          1234567890       1234567890
1        202-456-3456     202-456-3456
2  (202)-456-3456adsd   (202)-456-3456
3     (202)-456- 4567  (202)-456- 4567
4    1234564567(dads)      1234564567

Current output :
                    x               x1
0          1234567890       1234567890
1        202-456-3456     202-456-3456
2  (202)-456-3456adsd   (202)-456-3456
3     (202)-456- 4567  (202)-456- 4567
4    1234564567(dads)      1234564567(

PriyankaJ
  • 339
  • 4
  • 19

3 Answers3

5

You may use

((?:\(\d{3}\)|\d{3})?(?:\s|\s?-\s?)?\d{3}(?:\s|\s?-\s?)?\d{4})

See the regex demo

Details

  • (?:\(\d{3}\)|\d{3})? - an optional sequence of
    • \(\d{3}\) - (, three digits, )
    • | - or
    • \d{3} - three digits
  • (?:\s|\s?-\s?)? - an optional sequence of a whitespace char or an - enclosed with single optional whitespaces
  • \d{3} - three digits
  • (?:\s|\s?-\s?)? - an optional sequence of a whitespace char or an - enclosed with single optional whitespaces
  • \d{4} - four digits.

Pandas test:

>>> df1['x'].str.extract(r'((?:\(\d{3}\)|\d{3})?(?:\s|\s?-\s?)?\d{3}(?:\s|\s?-\s?)?\d{4})',expand= True)
                 0
0       1234567890
1     202-456-3456
2   (202)-456-3456
3  (202)-456- 4567
4       1234564567
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • 1
    Nice solution. Has the advantage of being highly tunable whilst not needing to worry about edge cases in the bits to ignore. I would go with this instead of my solution. +1 – Chris Apr 08 '20 at 08:28
  • Thanks a lot for the solution, but for the time being I will go with above solution for the simple reason that it looks simpler. – PriyankaJ Apr 08 '20 at 10:17
  • @PriyankaJ Without any exact requirements, it is up to you to decide. I tried to make *your original* approach work in a safer way, the other solution relies on many assumptions that may be true or not since we do not know what your other data look like. Just suit yourself. – Wiktor Stribiżew Apr 08 '20 at 10:23
  • @WiktorStribiżew, totally agree with you. Quick question - what does ?: in the expression mean? – PriyankaJ Apr 08 '20 at 12:02
  • @PriyankaJ It is a [non-capturing group](https://stackoverflow.com/questions/3512471/what-is-a-non-capturing-group-in-regular-expressions). In Pandas, with the `str.extract` and other regex methods, it is useful since it helps avoid creating redundant columns. – Wiktor Stribiżew Apr 08 '20 at 12:04
2

How about a different approach? Instead of trying to match the phone numbers, remove the bits you don't want:

import pandas as pd
df1 = pd.DataFrame({'x': ['1234567890', '202-456-3456', '(202)-456-3456adsd', '(202)-456- 4567', '1234564567(dads)']})
df1['x1'] = df1['x'].str.replace(r'\([^0-9]+\)|\D*$', '')

Output:
                    x               x1
0          1234567890       1234567890
1        202-456-3456     202-456-3456
2  (202)-456-3456adsd   (202)-456-3456
3     (202)-456- 4567  (202)-456- 4567
4    1234564567(dads)       1234564567

It means using str.replace instead of str.extract but I think the code is simpler as a result.

Explanation: \([^0-9]+\) matches any characters except 0-9 inside parentheses.

| means logical OR.

\D*$ matches zero or more non-numeric characters at the end of the string.

Used with replace, this matches the above pattern and replaces it with an empty string.

Chris
  • 1,618
  • 13
  • 21
0

I would use replace.

df1['x1'] = df1['x'].str.replace(r'(?<=\(\d{3}\)[-]\d{3}[-]\d{4})[a-z]*', '')
df1

Simply put replace Y if it is immediately to the right of X that is (?<+X)Y

Y= group of lower case alphanumerics - [a-z]*

X= three digits between () followed by a dash \(\d{3}\)[-] followed by; another three digits and a dash \(\d{3}\)[-] and finally followed by; four digits and a dash `(\d{4})

Output

enter image description here

wwnde
  • 26,119
  • 6
  • 18
  • 32