Cant extract correct/all information need

Question

im trying to get the cellphone/office phone number information off of this website: https://www.zillow.com/lender-profile/DougShoemaker/

ive tried playing around with bs4 but i can only get the first phone number. Im trying to get both office and cell numbers.

from selenium import webdriver
from bs4 import BeautifulSoup
import time


#Chrome webdriver filepath...Chromedriver version 74
driver = webdriver.Chrome(r'C:\Users\mfoytlin\Desktop\chromedriver.exe')
driver.get('https://www.zillow.com/lender-profile/DougShoemaker/')
soup = BeautifulSoup(driver.page_source, 'html.parser')
time.sleep(2)
phoneNum = driver.find_element_by_class_name('zsg-list_definition')
trial = phoneNum.find_element_by_class_name('zsg-sm-hide')
print(trial.text)

What problem are you having? What happens when you try to get the office and cell numbers? Do you get an incorrect result, an empty result, or an error message? — John Gordon, Jul 01 '19 at 19:43
i get the correct result for the first cell phone number, i just literally cannot figure out how to get the correct paths or searches to be able to get both phone numbers provided. The above code successfully finds and prints the first phone number but i cant get passed that...the way the tags are set up makes it tricky to get desired information @John Gordon — mcfoyt, Jul 01 '19 at 19:48

abdusco · Answer 1 · 2019-07-01T20:03:34.247

You don't have to use Selenium, or even BeautifulSoup. If you inspect network requests from Developer Tools (F12) > Network you can see that the data is fetched using an XHR request

You can make this request yourself and use the JSON response anyway you like.

POST https://mortgageapi.zillow.com/getRegisteredLender?partnerId=RD-CZMBMCZ
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:68.0) Gecko/20100101 Firefox/68.0
Referer: https://www.zillow.com/lender-profile/DougShoemaker/
Content-Type: application/json

{
  "fields": [
    "aboutMe",
    "address",
    "cellPhone",
    # ... other fields
    "website"
  ],
  "lenderRef": {
    "screenName": "DougShoemaker"
  }
}

Now, with requests library you can try:

import requests

if __name__ == '__main__':
    payload = {
        "fields": [
            "screenName",
            "cellPhone",
            "officePhone",
            "title",
        ],
        "lenderRef": {
            "screenName": "DougShoemaker"
        }
    }

    res = requests.post('https://mortgageapi.zillow.com/getRegisteredLender?partnerId=RD-CZMBMCZ',
                        json=payload)
    res.raise_for_status()
    data = res.json()

    cellphone, office_phone = data['lender']['cellPhone'], data['lender']['officePhone']
    cellphone_num = '({areaCode}) {prefix}-{number}'.format(**cellphone)
    office_phone_num = '({areaCode}) {prefix}-{number}'.format(**office_phone)
    print(office_phone_num, cellphone_num)

which prints:

(618) 619-4120 (618) 795-0790

score 0 · Accepted Answer · answered Jul 01 '19 at 19:53

try following xpath for each phone numbers

Office Phone:
//dt[contains(text(),'Office')]/following-sibling::dd/div/span
Cell Phone:
//dt[contains(text(),'Cell')]/following-sibling::dd/div/span
Fax Number:
//dt[contains(text(),'Fax')]/following-sibling::dd/div/span

score 0 · Answer 3 · answered Jul 01 '19 at 19:57

To extract the Office, Cell and Fax number, you have to induce WebDriverWait for the visibility_of_element_located() and you can use either of the following Locator Strategies:

Code Block:

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

options = webdriver.ChromeOptions()
options.add_argument('start-maximized')
# options.add_argument('disable-infobars')
options.add_argument('--disable-extensions')
driver = webdriver.Chrome(chrome_options=options, executable_path=r'C:\WebDrivers\chromedriver.exe')
driver.get('https://www.zillow.com/lender-profile/DougShoemaker/')
print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//dt[text()='Office']//following::dd[1]//span"))).get_attribute("innerHTML"))
print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//dt[text()='Cell']//following::dd[1]//span"))).get_attribute("innerHTML"))
print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//dt[text()='Fax']//following::dd[1]//span"))).get_attribute("innerHTML"))

Console Output:

(618) 619-4120
(618) 795-0790
(618) 619-4120

Cant extract correct/all information need

3 Answers3