-1

I need to replace german phone numbers in python, which is well-explained here: Regexp for german phone number format

Possible formats are:

06442) 3933023     
(02852) 5996-0       
(042) 1818 87 9919   
06442 / 3893023  
06442 / 38 93 02 3     
06442/3839023
042/ 88 17 890 0     
+49 221 549144 – 79  
+49 221 - 542194 79  
+49 (221) - 542944 79
0 52 22 - 9 50 93 10 
+49(0)121-79536 - 77 
+49(0)2221-39938-113 
+49 (0) 1739 906-44  
+49 (173) 1799 806-44
0173173990644
0214154914479
02141 54 91 44 79
01517953677
+491517953677
015777953677
02162 - 54 91 44 79
(02162) 54 91 44 79

I am using the following code:

df['A'] = df['A'].replace(r'(\(?([\d \-\)\–\+\/\(]+)\)?([ .\-–\/]?)([\d]+))', r'\TEL', regex=True)

The Problem is I have dates in the text:

df['A']
2017-03-07 13:48:39 Dear Sear Madam...

This is necassary to keep, how can I exclude the format: 2017-03-07and 13:48:39from my regex replacement?

Short Example:

df['A']
2017-03-077
2017-03-07
0211 11112244

desired output:

df['A']
TEL
2017-03-07
TEL
PV8
  • 5,799
  • 7
  • 43
  • 87
  • 1
    Try it using boundaries in the form of lookarounds https://regex101.com/r/iuXqUg/1 Note that for a match you don't need all the capturing groups – The fourth bird Oct 28 '19 at 12:39
  • You could add a negative lookahead `(?!\S)` and negative lookbehind `(?<!\S)` to assert that there is no non whitespace char before and after the match. Your pattern without the groups and unnecessary escapes in the character class is `(?<!\S)\(?[-\d )\–+/(]+\)?[- .–/]?\d+(?!\S)` see https://regex101.com/r/46GCf6/1. But note that this pattern would also match for example `)4` because both parenthesis are optional. See [this page](https://stackoverflow.com/questions/123559/a-comprehensive-regex-for-phone-number-validation) about validating phone numbers. – The fourth bird Oct 28 '19 at 12:52
  • dont get it 100% – PV8 Oct 28 '19 at 13:12
  • 1
    Regular expressions are for regular data and there is almost nothing regular about those numbers. How do I know that `0173173990644` is a phone number and not a serial number which should be ignored? – MonkeyZeus Oct 28 '19 at 13:16
  • In my case this could be also be replaced, the target is more to protect a specific format – PV8 Oct 28 '19 at 13:18
  • 1
    I missed that you also don't want to match `2017-03-07`. The current pattern matches that as the character class which can also match a `-` is repeated 1+ times. This would not match it https://regex101.com/r/tng4Ox/1 but note that as pointed out by @MonkeyZeus that the data you want to match is very broad. – The fourth bird Oct 28 '19 at 13:22

1 Answers1

1

Any way you slice it you are not dealing with regular data and regular expressions work best with regular data. You are always going to run into "false positives" in your situation.

Your best bet is to write out each pattern individually as a giant OR. Below is the pattern for the first three phone numbers so just do the rest of them.

\d{5}\) \d{7}|\(\d{5}\) \d{4}-\d|\(\d{3}\) \d{4} \d{2} \d{4}

https://regex101.com/r/6NPzup/1

MonkeyZeus
  • 20,375
  • 4
  • 36
  • 77
  • so there is no way, to protect one "special case" , everytime I have a rule it counts, and there is no order for the rules? – PV8 Oct 28 '19 at 14:12
  • 1
    @PV8 Copy+paste+and pray with regex very rarely works out because regex is highly context specific. You will learn a lot about how to write regex if you follow through with my suggestion. If you are blindly trying to use someone else's regex then there are going to be many edge-cases which you have not thought about and monkey-patching it will further compound the issues. – MonkeyZeus Oct 28 '19 at 14:49