1

I am very new to regular expression & seeking help to parse out phone numbers from HTML text

At source site, the html tags are very distorted & does not have any unique selectors that i can use . Below if the list of possibilities i am looking to parse.

raw = """+49 39291 55-217
02102 7007064
0152 01680970
+49 39291 55-216
02102 3802 22
0800 333004 451-100
+49 221 9937 26950
02151-47974510
+49(0)6105 937 -539
0211/409 2268
+49(0)6105 937 -539
+49211/584-623
0211 58422 2012
+49 (9131) 7-35335
+49 521 9488 2470
+ 49-40-70 70 84 - 0
0211 17 95 99 04
02151-47974327
+49 203 28900 1121
0211 9449-2555
+49 (5 41) 9 98 -2268"""

I tried this pattern but could not make out more from it

import re, requests

Phones = re.findall(re.compile(r'.*?(\(?\d{3}\D{0,3}\d{3}\D{0,3}\d{4}).*?'),raw)

phones
['102 7007064', '152 0168097', '151-4797451', '937 -539\n0211', '937 -539\n+4921', '584-623\n0211', '151-4797432']

Any advise or help is highly appreciated. Thank you

Paolo
  • 21,270
  • 6
  • 38
  • 69
Shekhar Samanta
  • 875
  • 2
  • 12
  • 25
  • are all the above valid? which ones are not valid? – Onyambu Aug 30 '18 at 09:29
  • `\D` matches line break chars , too. You should replace it with something like `[-./]?` – Wiktor Stribiżew Aug 30 '18 at 09:30
  • @WiktorStribiżew, i tired that but doesn't outputs well all, I observed a 2 points. either the phone numbers starts with +49 or a 0 – Shekhar Samanta Aug 30 '18 at 09:32
  • The first related question on the right side: https://stackoverflow.com/q/123559/8881141 – Mr. T Aug 30 '18 at 09:33
  • 1
    Guys, please do not suggest phone validation threads, it is about extraction of phone numbers from a longer text. – Wiktor Stribiżew Aug 30 '18 at 09:34
  • Yes true, its about extracting phone number from HTML source. – Shekhar Samanta Aug 30 '18 at 09:35
  • 1
    Try [`\+? {0,2}\d+ {0,2}[(-]?\d(?:[ \d]*\d)?[)-]? {0,2}\d+[/ -]?\d+[/ -]?\d+(?: *- *\d+)?`](https://regex101.com/r/opzBnV/1) – Wiktor Stribiżew Aug 30 '18 at 09:51
  • @WiktorStribiżew , its a very broader one, it returns lots of unwanted matches – Shekhar Samanta Aug 30 '18 at 09:59
  • try the pattern on this HTML : https://pastebin.com/WNxjLfhR – Shekhar Samanta Aug 30 '18 at 10:01
  • That is too long. Please provide exact specs for your pattern. – Wiktor Stribiżew Aug 30 '18 at 10:11
  • Probably, https://regex101.com/r/opzBnV/2 will be more precise. – Wiktor Stribiżew Aug 30 '18 at 10:16
  • what about using `sub` instead? ie `re.sub('[^+0-9\n]','',raw).split()` – Onyambu Aug 30 '18 at 10:17
  • @WiktorStribiżew , i actually trying to get the phone numbers from job pages like this : https://de.indeed.com/viewjob?jk=1d06971a8e322ba2&tk=1cm4o74d7958gfs7&from=serp&vjs=3 .those above phone numbers are possibilities that i get from this kind of job pages – Shekhar Samanta Aug 30 '18 at 10:32
  • If you are not going to provide the specs, the question is off-topic/too broad, I suggest closing until you clear it all up. – Wiktor Stribiżew Aug 30 '18 at 10:33
  • I am testing using this pattern : regex101.com/r/opzBnV/2 @WiktorStribiżew could you please post it as answer & I will accept soon – Shekhar Samanta Aug 30 '18 at 10:52
  • @WiktorStribiżew , yes i upvoted. I trying to learn in a more detailed way , could you please guide me to some good online tutorials that explains steps in detail – Shekhar Samanta Sep 01 '18 at 15:34
  • I do not know your level of regex knowledge so that I can only suggest doing all lessons at [regexone.com](http://regexone.com/), reading through [regular-expressions.info](http://www.regular-expressions.info), [regex SO tag description](http://stackoverflow.com/tags/regex/info) (with many other links to great online resources), and the community SO post called [What does the regex mean](http://stackoverflow.com/questions/22937618/reference-what-does-this-regex-mean). Also, [rexegg.com](http://rexegg.com) is worth having a look at. – Wiktor Stribiżew Sep 01 '18 at 20:12

1 Answers1

5

I suggest using this pattern:

(?:\B\+ ?49|\b0)(?: *[(-]? *\d(?:[ \d]*\d)?)? *(?:[)-] *)?\d+ *(?:[/)-] *)?\d+ *(?:[/)-] *)?\d+(?: *- *\d+)?

See the regex demo. Note it is written based on your comment saying the phone numbers starts with +49 or a 0 and on the list of examples you provided. It may be considered "work in progress" since you have not provided more specific rules for phone number extraction.

Pattern details

  • (?:\B\+ ?49|\b0) - a +, optional space, 49 or a 0, both substrings cannot be preceded with a word char
  • (?: *[(-]? *\d(?:[ \d]*\d)?)? - an optional substring matching 0+ spaces, then an optional ( or -, 0+ spaces, a digit and then an optional sequence of digits/spaces followed with a digit
  • *(?:[)-] *)? - 0+ spaces and then an optional sequence of ) or - followed with 0+ spaces
  • \d+ - 1+ digits
  • * - 0+ spaces
  • (?:[/)-] *)? - an optional sequence of /, ) or - followed with 0+ spaces
  • \d+ - 1+ digits
  • *(?:[/)-] *)? - 0+ spaces and then an optional sequence of /, ) or - followed with 0+ spaces
  • \d+ - 1+ digits
  • (?: *- *\d+)? - an optional sequence: 0+ spaces, -, 0+ spaces, 1+ digits.
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563