I am trying to use regular expression to extract phone number from web links. The problem I am facing is with unwanted id's and other elements of webpage. If anyone can suggest some improvements, it would be really helpful. Below is the code and regular expression I am using in Python,
from urllib2 import urlopen as uReq
uClient = uReq(url)
page_html = uClient.read()
print re.findall(r"(\(?\d{3}\D{0,3}\d{3}\D{0,3}\d{4}).*?",page_html)
Now, for most of the website, the script getting some page element values and sometimes accurate. Please suggest some modifications in expression
re.findall(r"(\(?\d{3}\D{0,3}\d{3}\D{0,3}\d{4}).*?",page_html)
My output looks like below for different url's
http://www.fraitagengineering.com/index.html
['(877) 424-4752']
http://hunterhawk.com/
['1481240672', '1481240643', '1479852632', '1478013441', '1481054486', '1481054560', '1481054598', '1481054588', '1476820246', '1481054521', '1481054540', '1476819829', '1481240830', '1479855986', '1479855990', '1479855994', '1479855895', '1476819760', '1476741750', '1476741750', '1476820517', '1479862863', '1476982247', '1481058326', '1481240672', '1481240830', '1513106590', '1481240643', '1479855986', '1479855990', '1479855994', '1479855895', '1479852632', '1478013441', '1715282331', '1041873852', '1736722557', '1525761106', '1481054486', '1476819760', '1481054560', '1476741750', '1481054598', '1476741750', '1481054588', '1476820246', '1481054521', '1476820517', '1479862863', '1481054540', '1476982247', '1476819829', '1481058326', '(925) 798-4950', '2093796260']
http://www.lbjewelrydesign.com/
['213-629-1823', '213-629-1823']
I want just phone numbers with (000) 000-0000
(not that I have added space after parenthesis),
(000)-000-0000or
000-000-0000` format. Any suggestions appreciated. Please note that I have already referred to this link : Find phone numbers in python script
I need improvement in regex for my specific needs.