1

I tried to scrape the phone number from a webpage using beautifulSoup. However the results turn empty when saved into csv file. The text format from web is: enter image description here

My code is:

phone = [d.find('a') for d in soup.find_all('div',{'class':'cbp-vm-cta'})]

What should I fix so the phone number can be scraped from the page?

swm
  • 519
  • 1
  • 4
  • 20

2 Answers2

2

You need to find the span tag and use data-content attribute and then use regular expression to get the phone number.

import re
import requests
from bs4 import BeautifulSoup

raw = requests.get('https://www.yellowpages.my/listing/results.php?keyword=boutique&where=selangor&screen=2').text
raw = raw.replace("</br>", "")

soup = BeautifulSoup(raw, 'html.parser')

name = soup.find_all('div', {'class' :'cbp-vm-companytext'})
phone = [re.findall('\>.*?<',d.find('span')['data-content'])[0][1:][:-1] for d in soup.find_all('div',{'class':'cbp-vm-cta'})]
addresses = [x.text.strip().split("\r\n")[-1].strip() for x in soup.find_all("div", class_='cbp-vm-address')]
print(phone)
#print(addresses)
#print(name)

Output:

['03-8922 0982', '018-651 9855', '012-931 2419', '03-5523 0664', '03-6057 1190', '03-6150 1314', '03-6150 4588', '03-40650044', '016-292 5956', '03-3250 6633', '03-7728 5339', '03-8063 6788']

Update


import re
import requests
from bs4 import BeautifulSoup

raw = requests.get('https://www.yellowpages.my/listing/results.php?keyword=boutique&where=selangor&screen=3').text
raw = raw.replace("</br>", "")

soup = BeautifulSoup(raw, 'html.parser')
phone = [re.findall('\>.*?<',d.find('span',attrs={"data-content": True})['data-content'])[0][1:][:-1] for d in soup.find_all('div',{'class':'cbp-vm-cta'})]
print(phone)

UPDATE Use try except block


import re
import requests
from bs4 import BeautifulSoup

raw = requests.get('https://www.yellowpages.my/listing/results.php?keyword=boutique&where=selangor&screen=1').text
raw = raw.replace("</br>", "")

soup = BeautifulSoup(raw, 'html.parser')
try:
   phone = [re.findall('\>.*?<',d.find('span',attrs={"data-content": True})['data-content'])[0][1:][:-1] for d in soup.find_all('div',{'class':'cbp-vm-cta'})]
   print(phone)
except:
   print("None")
KunduK
  • 32,888
  • 5
  • 17
  • 41
  • do you know why its not working once I tested with other pages? – swm Jan 16 '20 at 03:40
  • @swm : which page you are not getting values?share that page? – KunduK Jan 16 '20 at 09:24
  • It's for the other pages from the page you scrape. For example, https://www.yellowpages.my/listing/results.php?keyword=boutique&where=selangor&screen=1 (and few more pages) I couldnt find out why it happens when tested with other random pages as the web format are sames in all pages. – swm Jan 16 '20 at 09:34
  • @swn : The only difference from other page with that page is there one more span tag that is why it is failing for that page.I have updated the code. – KunduK Jan 16 '20 at 09:51
  • Sorry, do you know how to put exceptional in code. example except: js = None . as some pages dont have phone. so the error appear once it scrape those pages – swm Jan 16 '20 at 10:35
  • @swn : use try..except block – KunduK Jan 16 '20 at 10:42
0

You'll probably be able to access the "data-content" attribute by selecting the span.

a_text = [d.find('span', {'class': left-border'})['data-content'] 
    for d in soup.find_all('div', {'class':'cbp-vm-cta'})]

After that you'll have to find where the phone number starts with some string manipulation, throwing out the rest of the text, such as <a href=tel ..etc.

Ollie in PGH
  • 2,559
  • 2
  • 16
  • 19