Find USA phone numbers in python script

Question

the following python script allows me to scrape email addresses from a given file using regular expressions.

How could I add to this so that I can also get phone numbers? Say, if it was either the 7 digit or 10 digit (with area code), and also account for parenthesis?

My current script can be found below:

# filename variables
filename = 'file.txt'
newfilename = 'result.txt'

# read the file
if os.path.exists(filename):
        data = open(filename,'r')
        bulkemails = data.read()
else:
        print "File not found."
        raise SystemExit

# regex = something@whatever.xxx
r = re.compile(r'(\b[\w.]+@+[\w.]+.+[\w.]\b)')
results = r.findall(bulkemails)
emails = ""
for x in results:
        emails += str(x)+"\n"

# function to write file
def writefile():
        f = open(newfilename, 'w')
        f.write(emails)
        f.close()
        print "File written."

Regex for phone numbers:

(\d{3}[-\.\s]\d{3}[-\.\s]\d{4}|\(\d{3}\)\s*\d{3}[-\.\s]\d{4}|\d{3}[-\.\s]\d{4})

Another regex for phone numbers:

(?:(?:\+?1\s*(?:[.-]\s*)?)?(?:\(\s*([2-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02-9])\s*\)|([2-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02-9]))\s*(?:[.-]\s*)?)?([2-9]1[02-9]|[2-9][02-9]1|[2-9][02-9]{2})\s*(?:[.-]\s*)?([0-9]{4})(?:\s*(?:#|x\.?|ext\.?|extension)\s*(\d+))?

I just added to my post what I have for phone numbers. Having difficult detecting 7 or 10 digit numbers that don't have a hyphens. — Aaron, Oct 06 '10 at 01:26
Just one "country"/system or world-wide? Do you need to distinguish between cell/mobile and landline? Do you need to distinguish special-purpose numbers like 800 numbers? Possible +<"country" code> prefix? — John Machin, Oct 06 '10 at 02:04
I was hoping to keep it relatively simple. So not worry about the country code. It should be able to accept area codes with or without the parenthesis. Or just plain 7 digit numbers too. There doesn't need to be a distinguish between numbers like 800 numbers. — Aaron, Oct 06 '10 at 02:08
possible duplicate of [A comprehensive regex for phone number validation](http://stackoverflow.com/questions/123559/a-comprehensive-regex-for-phone-number-validation) — Daniel Vandersluis, Oct 06 '10 at 02:56
You're right, it looks like someone posted something that should work. is there anyway that someone could describe how this works? I've posted it below the initial one above. How would I be able to add this to my regex for email addresses? Thanks again for the help — Aaron, Oct 06 '10 at 03:23
Like this: http://stackoverflow.com/questions/3302482/i-need-a-regular-expression-to-convert-us-tel-number-to-link — BoltBait, Oct 06 '10 at 00:38
Brilliant. I'm working on an application and it needs to prevent people from sharing their phone numbers with each other. This works on a lot of different cases where users may try to be sneaky. Thank you for this. — Andrew Rout, Feb 23 '22 at 22:56

Auguste · Accepted Answer · 2010-10-06T03:00:00.950

66

If you are interested in learning Regex, you could take a stab at writing it yourself. It's not quite as hard as it's made out to be. Sites like RegexPal allow you to enter some test data, then write and test a Regular Expression against that data. Using RegexPal, try adding some phone numbers in the various formats you expect to find them (with brackets, area codes, etc), grab a Regex cheatsheet and see how far you can get. If nothing else, it will help in reading other peoples Expressions.

Edit: Here is a modified version of your Regex, which should also match 7 and 10-digit phone numbers that lack any hyphens, spaces or dots. I added question marks after the character classes (the []s), which makes anything within them optional. I tested it in RegexPal, but as I'm still learning Regex, I'm not sure that it's perfect. Give it a try.

(\d{3}[-\.\s]??\d{3}[-\.\s]??\d{4}|\(\d{3}\)\s*\d{3}[-\.\s]??\d{4}|\d{3}[-\.\s]??\d{4})

It matched the following values in RegexPal:

000-000-0000
000 000 0000
000.000.0000

(000)000-0000
(000)000 0000
(000)000.0000
(000) 000-0000
(000) 000 0000
(000) 000.0000

000-0000
000 0000
000.0000

0000000
0000000000
(000)0000000

edited Oct 06 '10 at 03:00

answered Oct 06 '10 at 01:04

Auguste

973
12
12

Thanks, I have found the RegexPal rather helpful. I added to my post and included what I have so far for phone numbers. Something I'm having difficulty doing is detecting 7 or 10 digit numbers that don't have any hyphens at all. – Aaron Oct 06 '10 at 01:25
@Aaron, I took a shot at modifying the Regex you gave to solve your problem. It's included in my answer, which I edited. Give it a try and see if it works. – Auguste Oct 06 '10 at 02:54
This looks really great. I just did some testing and appears to be working very well. My only question, how do implement this so that it can work with my existing email addresses? Is there a way to do this that isn't too much work? Thanks again – Aaron Oct 06 '10 at 03:49
You should be able to implement it similarly to the way the email Regex is already implemented; Try modifying a copy of the `# regex = something@whatever.xxx` block. I'd take a shot, but I've never touched Python before. You simply want to search for matches to the phone number Regex within the file you open, and output them to `result.txt`. – Auguste Oct 06 '10 at 04:22
Hey, what's the status on your project? Did you get it implemented? – Auguste Oct 10 '10 at 05:09
it matches non-latin characters ٠٧٧٠١٣٥٥٥٥ – Frank Jul 30 '16 at 23:51
Use flag re.A if you are using python3 and want to avoid non-latin unicode – Frank Jul 31 '16 at 00:02
Given this is the first search result on Google, I think it's necessary to point out that there's now a [phonenumbers](https://github.com/daviddrysdale/python-phonenumbers) module that can do this. – PLPeeters Feb 15 '17 at 14:57
@Auguste how would I match emoji characters between numbers? say i have 123emoji345emoji4444, and would like to get 1233454444 – sql-noob Jul 28 '17 at 17:10
I added additional condition for extra spaces around dashes. `r'(\(\d{3}\)\s*[-\.\s]\s*\d{3}\s*[-\.\s]??\s*\d{4}|\d{3}[-\.\s]??\d{3}[-\.\s]??\d{4}|\(\d{3}\)\s*\d{3}[-\.\s]??\d{4}|\d{3}[-\.\s]??\d{4})'` – Rami Alloush Nov 20 '20 at 21:01

dotancohen · Answer 2 · 2016-06-28T06:02:13.497

This is the process of building a phone number scraping regex.

First, we need to match an area code (3 digits), a trunk (3 digits), and an extension (4 digits):

reg = re.compile("\d{3}\d{3}\d{4}")

Now, we want to capture the matched phone number, so we add parenthesis around the parts that we're interested in capturing (all of it):

reg = re.compile("(\d{3}\d{3}\d{4})")

The area code, trunk, and extension might be separated by up to 3 characters that are not digits (such as the case when spaces are used along with the hyphen/dot delimiter):

reg = re.compile("(\d{3}\D{0,3}\d{3}\D{0,3}\d{4})")

Now, the phone number might actually start with a ( character (if the area code is enclosed in parentheses):

reg = re.compile("(\(?\d{3}\D{0,3}\d{3}\D{0,3}\d{4}).*?")

Now that whole phone number is likely embedded in a bunch of other text:

reg = re.compile(".*?(\(?\d{3}\D{0,3}\d{3}\D{0,3}\d{4}).*?")

Now, that other text might include newlines:

reg = re.compile(".*?(\(?\d{3}\D{0,3}\d{3}\D{0,3}\d{4}).*?", re.S)

Enjoy!

I personally stop here, but if you really want to be sure that only spaces, hyphens, and dots are used as delimiters then you could try the following (untested):

reg = re.compile(".*?(\(?\d{3})? ?[\.-]? ?\d{3} ?[\.-]? ?\d{4}).*?", re.S)

score 8 · Answer 3 · edited Mar 04 '15 at 20:24

8

I think this regex is very simple for parsing phone numbers

re.findall("[(][\d]{3}[)][ ]?[\d]{3}-[\d]{4}", lines)

edited Mar 04 '15 at 20:24

Andy

49,085
60
166
233

answered Mar 04 '15 at 20:18

user4959

455
1
5
8

Ali Naderi · Answer 4 · 2022-05-03T13:11:55.543

Below is completion of the answers above. This regex is also able to detect country code:

((?:\+\d{2}[-\.\s]??|\d{4}[-\.\s]??)?(?:\d{3}[-\.\s]??\d{3}[-\.\s]??\d{4}|\(\d{3}\)\s*\d{3}[-\.\s]??\d{4}|\d{3}[-\.\s]??\d{4}))

It can detect the samples below:

000-000-0000
000 000 0000
000.000.0000

(000)000-0000
(000)000 0000
(000)000.0000
(000) 000-0000
(000) 000 0000
(000) 000.0000

000-0000
000 0000
000.0000
0000000
0000000000
(000)0000000

# Detect phone numbers with country code
+00 000 000 0000
+00.000.000.0000
+00-000-000-0000
+000000000000
0000 0000000000
0000-000-000-0000
00000000000000
+00 (000)000 0000
0000 (000)000-0000
0000(000)000-0000

Updated as of 03.05.2022:

I fixed some issues in the phone numbers detection regex above, you find it in the link below. Complete the regex to include more country codes.

https://regex101.com/r/6Qcrk1/1

score 3 · Answer 5 · answered Nov 10 '16 at 14:33

3

For spanish phone numbers I use this with quite success:

re.findall( r'[697]\d{1,2}.\d{2,3}.\d{2,3}.\d{0,2}',str)

answered Nov 10 '16 at 14:33

Alex Moleiro

1,166
11
13

Th3Tr1ckst3r · Answer 6 · 2018-10-13T02:51:17.573

Since nobody has posted this regex yet, I will. This is what I use to find phone numbers. It matches all regular phone number formats you see in the United States. I did not need this regex to match international numbers so I didn't make adjustments to regex for that purpose.

phone_number_regex_pattern = r"\(?\d{3}\)?[-.\s]\d{3}[-.\s]\d{4}"

Use this pattern if you want simple phone numbers with no characters in between to match. An example of this would be: "4441234567".

phone_number_regex_pattern = r"\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}"

J. Doe · Answer 7 · 2018-09-26T10:19:16.833

0

You can check : http://regex.inginf.units.it/. With some training data and target, it constructs you an appropriate regex. It is not always perfect (check F-score). Let's try it with 15 examples :

re.findall("\w\d \w\w \w\w \w\w \w\d|(?<=[^\d][^_][^_] )[^_]\d[^ ]\d[^ ][^ ]+|(?<= [^<]\w\w \w\w[^:]\w[^_][^ ][^,][^_] )(?: *[^<]\d+)+",  
           """Lorem ipsum ©  04-42-00-00-00 dolor 1901 sit amet, consectetur +33 (0)4 42 00 00 00 adipisicing elit. 2016 Sapiente dicta fugit fugiat hic 04 42 00 00 00 aliquam itaque 04.42.00.00.00 facere, 13205 number: 100 000 000 00013 soluta. 4 Totam id dolores!""")

returns ['04 42 00 00 00', '04.42.00.00.00', '04-42-00-00-00', '50498,'] add more examples to gain precision

edited Sep 26 '18 at 10:19

answered Nov 23 '17 at 11:53

J. Doe

3,458
2
24
42

fails on simple "4441234567" number – Sarang Sep 26 '18 at 08:16
The solution would be better with more examples (15 is too short) – J. Doe Sep 26 '18 at 08:47

score 0 · Answer 8 · answered Sep 29 '21 at 23:39

 //search phone number using regex in python

 //form the regex according to your output


 // with this you can get single mobile number



phoneRegex = re.compile(r"\d\d\d-\d\d\d-\d\d\d\d")

Mobile = phoneRegex.search("my number is 123-456-6789")

print(Mobile.group())

Output: 123-456-6789


phoneRegex1 = re.compile(r"(\d\d\d-)?\d\d\d-\d\d\d\d")

Mobile1 = phoneRegex1.search("my number is 123-456-6789")

print(Mobile1.group())

Output: 123-456-789


Mobile1 = phoneRegex1.search("my number is 456-6789")

print(Mobile1.group())

Output: 456-678

And what happens when there is no match? This: `AttributeError: 'NoneType' object has no attribute 'group'` Rending the code all but useless. — DataMinion, May 12 '22 at 20:56

score 0 · Answer 9 · edited Jan 08 '22 at 21:39

0

While these are simple solutions they are all incorrect for North America. The problem lies in the fact that area-code and exchange numbers cannot start with a zero or a one.

r"(\\(?[2-9]\d{2}\\)?[ -])?[2-9]\d{2}-\d{4}"

would be the correct way to parse a 7 or 10-digit phone number.
(202) 555-4111
(202)-555-4111
202-555-4111
555-4111
will all parse correctly.

edited Jan 08 '22 at 21:39

buddemat

4,552
14
29
49

answered Jan 08 '22 at 19:49

Robert Bruce Nye

1
1

score 0 · Answer 10 · answered Feb 11 '23 at 16:20

0

Use this code to find the number like "416-676-4560"

doc=browser.page_source
    phones=re.findall(r'[\d]{3}-[\d]{3}-[\d]{4}',doc)

answered Feb 11 '23 at 16:20

Muhammad Qaiser Abas Khan Sial

100
6

Find USA phone numbers in python script

10 Answers10

Linked

Related