1

I'm trying to extract the phone numbers from many files of emails. I wrote regex code to extract them but I got the results for just one format.

PHONERX = re.compile("(\d{3}[-\.\s]??\d{3}[-\.\s]??\d{4}|\(\d{3}\)\s*\d{3}[-\.\s]??\d{4}|\d{3}[-\.\s]??\d{4})")

phonenumber = re.findall(PHONERX,content)

when I reviewed the data, I found there were many formats for phone numbers.

How can I extract all the phone numbers that have these format together:

800-569-0123
1-866-523-4176
(324)442-9843
(212) 332-1200
713/853-5620
713 853-0357
713 837 1749

This link is a sample for the dataset. the problem is sometime the phone numbers regex extract from the messageId and other numbers in the email https://www.dropbox.com/sh/pw2yfesim4ejncf/AADwdWpJJTuxaJTPfha38OdRa?dl=0

Ash
  • 23
  • 7

2 Answers2

0

You don't need to include all the possibilities using a logical OR. You can use following regex:

(?:\(\d+\)\s?\d*|\d+)([-\/ ]\d+){1,3}

see the Demo

For using with re.findall() use non-captured group:

(?:\(\d+\)\s?\d*|\d+)(?:[-\/ ]\d+){1,3}
Mazdak
  • 105,000
  • 18
  • 159
  • 188
  • I tried it but i didn't get a result for fall phone number. the result that I got look like this phonenumber=[' 14', '-7796', '-3490']) – Ash Apr 24 '17 at 05:30
  • @Ash That's because the `re.findall` will give you the result of captured groups. If you want to get the whole match you need to use non-captured group by adding `?:`. Check out the update. – Mazdak Apr 24 '17 at 05:39
  • I just apdated the question with code i'm using, I'll try yours again thank you – Ash Apr 24 '17 at 05:42
  • your code looks perfect as I saw in the demo but I don't why it didn't work the same with me. I think because of the dataset which are many files of emails. thank you – Ash Apr 24 '17 at 06:17
0

You may want to use:

\(?(?:1-)?\b[2-9][0-9]{2}\)?[-. \/]?[2-9][0-9]{2}[-. ]?[0-9]{4}\b

Which will match all your examples + ignore false positives, like:

113 837 1749
222 2222 22222

Regex Demo and Explanation

Python Demo

Pedro Lobito
  • 94,083
  • 31
  • 258
  • 268
  • Can you please define *no result*? both demos work as expected. Did you get any errors? – Pedro Lobito Apr 24 '17 at 05:28
  • where should I use this re.DOTALL | re.MULTILINE PHONERX = re.compile("\(?(?:1-)?\b[2-9][0-9]{2}\)?[-. /]?[2-9][0-9]{2}[-. ]?[0-9]{4}\b") phonenumber = re.findall(PHONERX,content, re.DOTALL | re.MULTILINE) – Ash Apr 24 '17 at 05:39
  • You can use the code from https://ideone.com/A8RQcC, it works as intended. – Pedro Lobito Apr 24 '17 at 05:44
  • Thank you, for you answer. it worked perfectly in your example but it didn't work with me. i dont know why. I think because i'm extracting from many files of emails – Ash Apr 24 '17 at 06:03
  • I may know what's going on, can you post a sample of the email's source? Please also include the headers. On which telephone formar did your regex work? – Pedro Lobito Apr 24 '17 at 10:35
  • ok I'll update it now, my old regex extract from the messageID number too from the header – Ash Apr 26 '17 at 03:20
  • I just updated the question and I added the sample in the link – Ash Apr 26 '17 at 03:25
  • I've tested my regex with the new email source and it works as expected, it only extracts the phone #'s. – Pedro Lobito Apr 26 '17 at 08:21
  • Yeah and I tested in the regex demo it worked but it didn't work with me when I tried with all data. I just update the link with more than 5 email files. if you can try it works with more than one email. thanks a lot – Ash Apr 26 '17 at 10:17