I'm trying to get the company name, sector, and industry for stocks. I download the HTML for 'https://finance.yahoo.com/q/in?s={}+Industry'.format(sign)
, and then attempt to parse it with .xpath()
from lxml.html
.
To get the XPath for the data I'm trying to scrape, I go to the site in Chrome, right-click on the item, click Inspect Element
, right-click on the highlighted area, and click Copy XPath
. This has always worked for me in the past.
This problem can be reproduced with the following code (I'm using Apple as an example):
import requests
from lxml import html
page_p = 'https://finance.yahoo.com/q/in?s=AAPL+Industry'
name_p = '//*[@id="yfi_rt_quote_summary"]/div[1]/div/h2/text()'
sect_p = '//*[@id="yfncsumtab"]/tbody/tr[2]/td[1]/table[2]/tbody/tr/td/table/tbody/tr[1]/td/a/text()'
indu_p = '//*[@id="yfncsumtab"]/tbody/tr[2]/td[1]/table[2]/tbody/tr/td/table/tbody/tr[2]/td/a/text()'
page = requests.get(page_p)
tree = html.fromstring(page.text)
name = tree.xpath(name_p)
sect = tree.xpath(sect_p)
indu = tree.xpath(indu_p)
print('Name: {}\nSector: {}\nIndustry: {}'.format(name, sect, indu))
Which gives this output:
Name: ['Apple Inc. (AAPL)']
Sector: []
Industry: []
It's not encountering any download difficulties, as it's able to retrieve name
, but the other two don't work. If I replace their paths with tr[1]/td/a/text()
and tr[1]/td/a/text()
, respectively, it returns this:
Name: ['Apple Inc. (AAPL)']
Sector: ['Consumer Goods', 'Industry Summary', 'Company List', 'Appliances', 'Recreational Goods, Other']
Industry: ['Electronic Equipment', 'Apple Inc.', 'AAPL', 'News', 'Industry Calendar', 'Home Furnishings & Fixtures', 'Sporting Goods']
Obviously I could just slice out the 1st item in each list to get the data I need.
What I don't understand is that when I add tbody/
to the start (//tbody/tr[#]/td/a/text()
) it fails again, even though the console in Chrome clearly shows both tr
s as being children of a tbody
element.
Why does this happen?