3

I have a hard time figuring out a correct path with my web scraping code.

I am trying to scrape different info from http://financials.morningstar.com/company-profile/c.action?t=AAPL. I have tried several paths, and some seem to work and some not. I am interested in CIK under Operation Details

page = requests.get('http://financials.morningstar.com/company-profile/c.action?t=AAPL')
tree=html.fromstring(page.text)


#desc = tree.xpath('//div[@class="r_title"]/span[@class="gry"]/text()')  #works

#desc = tree.xpath('//div[@class="wrapper"]//div[@class="headerwrap"]//div[@class="h_Logo"]//div[@class="h_Logo_row1"]//div[@class="greeter"]/text()')    #works

#desc = tree.xpath('//div[@id="OAS_TopLeft"]//script[@type="text/javascript"]/text()')   #works

desc = tree.xpath('//div[@class="col2"]//div[@id="OperationDetails"]//table[@class="r_table1 r_txt2"]//tbody//tr//th[@class="row_lbl"]/text()')

I can't figure the last path. It seems like I am following the path correctly, but I get empty list.

AK9309
  • 761
  • 3
  • 13
  • 33
  • the last element, th, which is table header in html, so you probably need to change that to td which is for table data. – postelrich Oct 14 '15 at 18:33
  • http://stackoverflow.com/questions/24163745/beginner-to-scraping-keep-on-getting-empty-lists This might be a similar problem to yours take a look – James Russo Oct 14 '15 at 18:49
  • http://stackoverflow.com/questions/33110734/xpath-not-working-for-screen-scraping/33111061?noredirect=1#comment54037557_33111061 here an error in the html like that causes an empty parse – rebeling Oct 14 '15 at 19:00

1 Answers1

3

The problem is that Operational Details are loaded separately with an additional GET request. Simulate it in your code maintaining a web-scrapin session:

import requests
from lxml import html


with requests.Session() as session:
    page = session.get('http://financials.morningstar.com/company-profile/c.action?t=AAPL')
    tree = html.fromstring(page.text)

    # get the operational details
    response = session.get("http://financials.morningstar.com/company-profile/component.action", params={
        "component": "OperationDetails",
        "t": "XNAS:AAPL",
        "region": "usa",
        "culture": "en-US",
        "cur": "",
        "_": "1444848178406"
    })

    tree_details = html.fromstring(response.content)
    print tree_details.xpath('.//th[@class="row_lbl"]//text()')

Old answer:

It's just that you should remove tbody from the expression:

//div[@class="col2"]//div[@id="OperationDetails"]//table[@class="r_table1 r_txt2"]//tr//th[@class="row_lbl"]/text()

tbody is an element that is inserted by the browser to define the data rows in a table.

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • I still get an empty list. I believe my problem is that there are several `tr` in the table. So I should give it a number for `tr` like `//table[@class="r_table1 r_txt2"]//tr[3]//th[@class="row_lbl"]/text()`. But I still get an empty list – AK9309 Oct 14 '15 at 18:39
  • @AK9309 the problem is that the operational details are loaded dynamically with an additional get request to `http://financials.morningstar.com/company-profile/component.action`. – alecxe Oct 14 '15 at 18:59
  • I understand now. Thank you for taking your time and explaining it. – AK9309 Oct 14 '15 at 19:09
  • It works for AAPL(Apple) and GOOGL(Google). When I try AA or AB I get XMLSyntaxError. But I know that the page exists – AK9309 Oct 14 '15 at 19:30
  • @AK9309 might be something wrong in GET parameters - open browser developer tools and mimic exactly the same parameters as you've seen sent in the browser. Also, try printing out the `response` or `response.status_code` and see what's there.. – alecxe Oct 14 '15 at 19:32
  • I believe the problem is that different stocks have different `"t"`. For example AAPL has XNAS, while AA has XNYS. So I decided to not duplicate `"t"` every time. So I have `"t": stock` instead. Seems to be working now – AK9309 Oct 14 '15 at 20:06