Result turns empty when scraping the text in using beautifulSoup

Question

I tried to scrape the phone number from a webpage using beautifulSoup. However the results turn empty when saved into csv file. The text format from web is:

My code is:

phone = [d.find('a') for d in soup.find_all('div',{'class':'cbp-vm-cta'})]

What should I fix so the phone number can be scraped from the page?

edited the question with full code n url for your reference – swm Jan 15 '20 at 11:47 — swm, Jan 15 '20 at 11:47

KunduK · Accepted Answer · 2020-01-16T10:44:00.000

You need to find the span tag and use data-content attribute and then use regular expression to get the phone number.

import re
import requests
from bs4 import BeautifulSoup

raw = requests.get('https://www.yellowpages.my/listing/results.php?keyword=boutique&where=selangor&screen=2').text
raw = raw.replace("</br>", "")

soup = BeautifulSoup(raw, 'html.parser')

name = soup.find_all('div', {'class' :'cbp-vm-companytext'})
phone = [re.findall('\>.*?<',d.find('span')['data-content'])[0][1:][:-1] for d in soup.find_all('div',{'class':'cbp-vm-cta'})]
addresses = [x.text.strip().split("\r\n")[-1].strip() for x in soup.find_all("div", class_='cbp-vm-address')]
print(phone)
#print(addresses)
#print(name)

Output:

['03-8922 0982', '018-651 9855', '012-931 2419', '03-5523 0664', '03-6057 1190', '03-6150 1314', '03-6150 4588', '03-40650044', '016-292 5956', '03-3250 6633', '03-7728 5339', '03-8063 6788']

Update

import re
import requests
from bs4 import BeautifulSoup

raw = requests.get('https://www.yellowpages.my/listing/results.php?keyword=boutique&where=selangor&screen=3').text
raw = raw.replace("</br>", "")

soup = BeautifulSoup(raw, 'html.parser')
phone = [re.findall('\>.*?<',d.find('span',attrs={"data-content": True})['data-content'])[0][1:][:-1] for d in soup.find_all('div',{'class':'cbp-vm-cta'})]
print(phone)

UPDATE Use try except block

import re
import requests
from bs4 import BeautifulSoup

raw = requests.get('https://www.yellowpages.my/listing/results.php?keyword=boutique&where=selangor&screen=1').text
raw = raw.replace("</br>", "")

soup = BeautifulSoup(raw, 'html.parser')
try:
   phone = [re.findall('\>.*?<',d.find('span',attrs={"data-content": True})['data-content'])[0][1:][:-1] for d in soup.find_all('div',{'class':'cbp-vm-cta'})]
   print(phone)
except:
   print("None")

do you know why its not working once I tested with other pages? — swm, Jan 16 '20 at 03:40
@swm : which page you are not getting values?share that page? — KunduK, Jan 16 '20 at 09:24
It's for the other pages from the page you scrape. For example, https://www.yellowpages.my/listing/results.php?keyword=boutique&where=selangor&screen=1 (and few more pages) I couldnt find out why it happens when tested with other random pages as the web format are sames in all pages. — swm, Jan 16 '20 at 09:34
@swn : The only difference from other page with that page is there one more span tag that is why it is failing for that page.I have updated the code. — KunduK, Jan 16 '20 at 09:51
Sorry, do you know how to put exceptional in code. example except: js = None . as some pages dont have phone. so the error appear once it scrape those pages — swm, Jan 16 '20 at 10:35

score 0 · Answer 2 · answered Jan 15 '20 at 11:47

You'll probably be able to access the "data-content" attribute by selecting the span.

a_text = [d.find('span', {'class': left-border'})['data-content'] 
    for d in soup.find_all('div', {'class':'cbp-vm-cta'})]

After that you'll have to find where the phone number starts with some string manipulation, throwing out the rest of the text, such as <a href=tel ..etc.

Result turns empty when scraping the text in using beautifulSoup

2 Answers2