0

I'm trying to get the company name, sector, and industry for stocks. I download the HTML for 'https://finance.yahoo.com/q/in?s={}+Industry'.format(sign), and then attempt to parse it with .xpath() from lxml.html.

To get the XPath for the data I'm trying to scrape, I go to the site in Chrome, right-click on the item, click Inspect Element, right-click on the highlighted area, and click Copy XPath. This has always worked for me in the past.

This problem can be reproduced with the following code (I'm using Apple as an example):

import requests
from lxml import html

page_p = 'https://finance.yahoo.com/q/in?s=AAPL+Industry'
name_p = '//*[@id="yfi_rt_quote_summary"]/div[1]/div/h2/text()'
sect_p = '//*[@id="yfncsumtab"]/tbody/tr[2]/td[1]/table[2]/tbody/tr/td/table/tbody/tr[1]/td/a/text()'
indu_p = '//*[@id="yfncsumtab"]/tbody/tr[2]/td[1]/table[2]/tbody/tr/td/table/tbody/tr[2]/td/a/text()'

page = requests.get(page_p)
tree = html.fromstring(page.text)

name = tree.xpath(name_p)
sect = tree.xpath(sect_p)
indu = tree.xpath(indu_p)

print('Name: {}\nSector: {}\nIndustry: {}'.format(name, sect, indu))

Which gives this output:

Name: ['Apple Inc. (AAPL)']
Sector: []
Industry: []

It's not encountering any download difficulties, as it's able to retrieve name, but the other two don't work. If I replace their paths with tr[1]/td/a/text() and tr[1]/td/a/text(), respectively, it returns this:

Name: ['Apple Inc. (AAPL)']
Sector: ['Consumer Goods', 'Industry Summary', 'Company List', 'Appliances', 'Recreational Goods, Other']
Industry: ['Electronic Equipment', 'Apple Inc.', 'AAPL', 'News', 'Industry Calendar', 'Home Furnishings & Fixtures', 'Sporting Goods']

Obviously I could just slice out the 1st item in each list to get the data I need.

What I don't understand is that when I add tbody/ to the start (//tbody/tr[#]/td/a/text()) it fails again, even though the console in Chrome clearly shows both trs as being children of a tbody element.

Chrome console showing HTML hierarchy

Why does this happen?

spelchekr
  • 933
  • 3
  • 11
  • 19

1 Answers1

7

Browsers parse HTML and build a element tree from that; in that process they'll insert elements that may be missing from the input HTML document.

In this case, the <tbody> elements are not in the source HTML. Your browser inserts them because they are implied in the structure if missing. LXML doesn't insert them however.

Your browser tools are not the best tools to build XPath queries for this reason.

Removing the tbody/ path elements produces the results you are looking for:

>>> sect_p = '//*[@id="yfncsumtab"]/tr[2]/td[1]/table[2]/tr/td/table/tr[1]/td/a/text()'
>>> indu_p = '//*[@id="yfncsumtab"]/tr[2]/td[1]/table[2]/tr/td/table/tr[2]/td/a/text()'
>>> tree.xpath(sect_p)
['Consumer Goods']
>>> tree.xpath(indu_p)
['Electronic Equipment']
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • I googled for "tools to build XPath queries" since your comment had me wondering. Found this: https://www.quora.com/Which-is-the-best-tool-used-to-find-XPath I think I'll check out ChroPath or the Firefox add-in - XPath Checker. – harperville Jun 13 '18 at 15:36