3

I am trying to use regular expression to extract phone number from web links. The problem I am facing is with unwanted id's and other elements of webpage. If anyone can suggest some improvements, it would be really helpful. Below is the code and regular expression I am using in Python,

from urllib2 import urlopen as uReq
uClient = uReq(url)
page_html = uClient.read()
print re.findall(r"(\(?\d{3}\D{0,3}\d{3}\D{0,3}\d{4}).*?",page_html)

Now, for most of the website, the script getting some page element values and sometimes accurate. Please suggest some modifications in expression

re.findall(r"(\(?\d{3}\D{0,3}\d{3}\D{0,3}\d{4}).*?",page_html)

My output looks like below for different url's

http://www.fraitagengineering.com/index.html
['(877) 424-4752']
http://hunterhawk.com/
['1481240672', '1481240643', '1479852632', '1478013441', '1481054486', '1481054560', '1481054598', '1481054588', '1476820246', '1481054521', '1481054540', '1476819829', '1481240830', '1479855986', '1479855990', '1479855994', '1479855895', '1476819760', '1476741750', '1476741750', '1476820517', '1479862863', '1476982247', '1481058326', '1481240672', '1481240830', '1513106590', '1481240643', '1479855986', '1479855990', '1479855994', '1479855895', '1479852632', '1478013441', '1715282331', '1041873852', '1736722557', '1525761106', '1481054486', '1476819760', '1481054560', '1476741750', '1481054598', '1476741750', '1481054588', '1476820246', '1481054521', '1476820517', '1479862863', '1481054540', '1476982247', '1476819829', '1481058326', '(925) 798-4950', '2093796260']
http://www.lbjewelrydesign.com/
['213-629-1823', '213-629-1823']

I want just phone numbers with (000) 000-0000 (not that I have added space after parenthesis),(000)-000-0000or000-000-0000` format. Any suggestions appreciated. Please note that I have already referred to this link : Find phone numbers in python script

I need improvement in regex for my specific needs.

D-hash-pirit
  • 407
  • 2
  • 5
  • 12
  • 2
    This answer might be helpful: https://stackoverflow.com/questions/3868753/find-phone-numbers-in-python-script – jakevdp Dec 12 '17 at 21:44

2 Answers2

1

You can avoid searching inside ids, other attributes or inside HTML markup at all if only you would be able to search the plain text of the web page only. You can do it by processing the web page content through BeautifulSoup HTML parser:

from urllib2 import urlopen as uReq

from bs4 import BeautifulSoup

page_text = BeautifulSoup(uReq(url), "html.parser").get_text()

Then, as Jake mentioned in comments, you can make your regular expression more reliable:

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
1

The following regular expression can be used to match the samples that you presented and other similar numbers:

(\([0-9]{3}\)[\s-]?|[0-9]{3}-)[0-9]{3}-[0-9]{4}

The following example script can be used to test positive and negative cases other than play with the regular expression:

import re

positiveExamples = [
    '(000) 000-0000',
    '(000)-000-0000',
    '(000)000-0000',
    '000-000-0000'
]
negativeExamples = [
    '000 000-0000',
    '000-000 0000',
    '000 000 0000',
    '000000-0000',
    '000-0000000',
    '0000000000'
]

reObj = re.compile(r"(\([0-9]{3}\)[\s-]?|[0-9]{3}-)[0-9]{3}-[0-9]{4}")

for example in positiveExamples:
    print 'Asserting positive example: %s' % example
    assert reObj.match(example)

for example in negativeExamples:
    print 'Asserting negative example: %s' % example
    assert reObj.match(example) == None
Eduardo
  • 657
  • 1
  • 9
  • 28