0

I need to get data/string from finance yahoo. However, relevant information is "hidden" under breakdown list.

As you see, I can access other data, e.g. total revenue, cost of revenue. Problem occurs when I try to access data hidden under breakdown list - Current Assets, Inventory (which is under Total Assets and Current Assets sections).

Python raises AttributeError: 'NoneType' object has no attribute 'find_next' error which I do not find illustrative.

P.S. I found that problem are these elements by commenting out each line

import urllib.request as url
from bs4 import BeautifulSoup

company = input('enter companies abbreviation')
income_page = 'https://finance.yahoo.com/quote/' + company + '/financials/'
balance_page = 'https://finance.yahoo.com/quote/' + company + '/balance-sheet/'
set_income_page = url.urlopen(income_page).read()
set_balance_page = url.urlopen(balance_page).read()
soup_income = BeautifulSoup(set_income_page, 'html.parser')
soup_balance = BeautifulSoup(set_balance_page, 'html.parser')

revenue_element = soup_income.find('span', string='Total Revenue').find_next('span').text
cogs_element = soup_income.find('span', string='Cost of Revenue').find_next('span').text
ebit_element = soup_income.find('span', string='Operating Income').find_next('span').text
net_element = soup_income.find('span', string='Pretax Income').find_next('span').text
short_assets_element = soup_balance.find('span', string='Current Assets').find_next('span').text
inventory_element = soup_balance.find('span', string='Inventory').find_next('span').text
flaxel
  • 4,173
  • 4
  • 17
  • 30
  • The error that you're getting here means that BeautifulSoup wasn't able to find the element that you are calling `find_next` from (so the `find` returns `None`). Almost surely this tag doesn't exist in the page as it is fetched, but rather is generated when you click on the section heading. – tegancp Oct 12 '20 at 22:12
  • That's true. But how I can access string which is generated when I click on heading? – Karolis Zimantas Oct 13 '20 at 10:39
  • I would start with determining **how** the content is being generated; this will affect the choice of method to extract it. The answer to [this question](https://stackoverflow.com/questions/17597424/how-to-retrieve-the-values-of-dynamic-html-content-using-python) for example, has some links to resources that may be helpful – tegancp Oct 13 '20 at 15:50

1 Answers1

0

Here is an example of parsing this web page using selenium. It allows emulate user behavior: wait till page is loaded, close pop-up, extend treenode by click it and extract some information from it.

from selenium import webdriver
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from bs4 import BeautifulSoup

company = input('enter companies abbreviation: ')

chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
wd = webdriver.Chrome('<<PATH_TO_CHROMEDRIVER>>', options=chrome_options)

# delay (how long selenium waits for element to be loaded)
DELAY = 30

# maximize browser window
wd.maximize_window()

# load page via selenium
wd.get('https://finance.yahoo.com/quote/' + company + '/financials/')

# check for popup, close it
try:
    btn = WebDriverWait(wd, DELAY).until(EC.presence_of_element_located((By.XPATH, '//button[text()="I agree"]')))
    wd.execute_script("arguments[0].scrollIntoView();", btn)
    wd.execute_script("arguments[0].click();", btn)
except:
    pass

# wait for page to load
results = WebDriverWait(wd, DELAY).until(EC.presence_of_element_located((By.ID, 'Col1-1-Financials-Proxy')))

# parse content
soup_income = BeautifulSoup(results.get_attribute('innerHTML'), 'html.parser')

# extract values
revenue_element = soup_income.find('span', string='Total Revenue').find_next('span').text
cogs_element = soup_income.find('span', string='Cost of Revenue').find_next('span').text
ebit_element = soup_income.find('span', string='Operating Income').find_next('span').text
net_element = soup_income.find('span', string='Pretax Income').find_next('span').text

# load page via selenium
wd.get('https://finance.yahoo.com/quote/' + company + '/balance-sheet/')

# wait for page to load
results = WebDriverWait(wd, DELAY).until(EC.presence_of_element_located((By.ID, 'Col1-1-Financials-Proxy')))

# expand total assets
btn = WebDriverWait(wd, DELAY).until(EC.element_to_be_clickable((By.XPATH, '//span[text()="Total Assets"]/preceding-sibling::button')))
wd.execute_script("arguments[0].scrollIntoView();", btn)
wd.execute_script("arguments[0].click();", btn)
    
# expand inventory
btn = WebDriverWait(wd, DELAY).until(EC.element_to_be_clickable((By.XPATH, '//span[text()="Current Assets"]/preceding-sibling::button')))
wd.execute_script("arguments[0].scrollIntoView();", btn)
wd.execute_script("arguments[0].click();", btn)

# parse content
soup_balance = BeautifulSoup(results.get_attribute('innerHTML'), 'html.parser')

# extract values
short_assets_element = soup_balance.find('span', string='Current Assets').find_next('span').text
inventory_element = soup_balance.find('span', string='Inventory').find_next('span').text

# close webdriver
wd.quit()

print(revenue_element)
print(cogs_element)
print(ebit_element)
print(net_element)
print(short_assets_element)
print(inventory_element)
Alexandra Dudkina
  • 4,302
  • 3
  • 15
  • 27
  • thanks for reply and code. However, i'm struglling with this line, i.e. 'btn = WebDriverWait(wd, DELAY).until(EC.element_to_be_clickable(.....' Selenium raises selenium.common.exceptions.TimeoutException: Message: error. I've tried to change delay time and other approaches. Could you elaborate on this error (meaning, etc) and handling techniques. P.s. i've searched on open sources, but could not find any 'digestible' info – Karolis Zimantas Nov 04 '20 at 17:22
  • Error message means that after a configured delay buttons is not clickable (most probably not found using XPath provided). In this situation it would be helpful rerun last session without `--headless` flag and check manually whether button id present or not and why is this case different from successful ones. – Alexandra Dudkina Nov 04 '20 at 18:46