Try to parse phone number from html, but get a lot of empty rows

Question

I can't understand how to get phone number from html by regex. I check my regex here, it works and must get the number from this link

I try to parse like that:

import requests
import re

url = 'https://a101.ru'
r = requests.get(url)
html = r.text
result = re.findall('((8|\+7)[\- ]?)?(\(?\d{3}\)?[\- ]?)?[\d\- ]{7,10}', html)
print(result)

And get this: 
[(u'', u'', u''), (u'', u'', u'').....(u'+7 ', u'+7', u'(495) ')....(u'', u'', u'')]

Required reading: https://stackoverflow.com/a/1732454/10553976 — Charles Landau, Feb 05 '19 at 14:57
Even if you could parse XML, would the element ` +7 (495) 221-40-21` provide one or two results? — wallyk, Feb 05 '19 at 15:04

Martin Evans · Accepted Answer · 2019-02-05T15:46:59.617

1

You could use the regex to spot the tel: part of the href

import re
import requests

r = requests.get('https://a101.ru', verify=False)
print re.findall(r'tel:(.*?)">', r.text)

For that page it would spot 4 matches:

['+7(495)221-40-21', '+7(495)221-40-21', '+7(495)221-40-21', '+7(495)221-40-21']

Normally I would use BeautifulSoup to parse the file correctly and extract the information, but for very specific minor uses, regex could be used with care.

You can obtain the same results with BeautifulSoup as follows:

from bs4 import BeautifulSoup
import requests
import re

r = requests.get('https://a101.ru', verify=False)
soup = BeautifulSoup(r.content, "html.parser")
print([tel['href'][4:] for tel in soup.find_all('a', href=re.compile(r'tel:'))])

edited Feb 05 '19 at 15:46

answered Feb 05 '19 at 15:38

Martin Evans

45,791
17
81
97

Can i use this regex ('tel:(.*?)">') to all html's to take phone numbers? – 2manov Feb 05 '19 at 15:57
1

The `tel:` prefix is used by phones to actually dial the number, not all websites add it. If it is used, you can be fairly certain that it is a valid number. – Martin Evans Feb 05 '19 at 15:59

Try to parse phone number from html, but get a lot of empty rows

1 Answers1