1

I would like to scrap the content a the following website:

http://financials.morningstar.com/ratios/r.html?t=AMD

In there under Key Ratios I would like to click on "Growth" button and then scrap the data in Python.

How can I do that?

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
TJ1
  • 7,578
  • 19
  • 76
  • 119
  • Tried to use HttpFox toolbar in Firefox to find the URL that is called, without much success. Thanks. – TJ1 Mar 11 '15 at 04:09
  • A little comment on the side: BBG provides this as well in a much easier to scrape format, but surprisingly, Morningstar provides it as far back as 10 years. Interesting. – WGS Mar 11 '15 at 05:18

1 Answers1

1

You can solve it with requests+BeautifulSoup. There is an asynchronous GET request sent to the http://financials.morningstar.com/financials/getKeyStatPart.html endpoint which you need to simulate. The Growth table is located inside the div with id="tab-growth":

from bs4 import BeautifulSoup
import requests


url = 'http://financials.morningstar.com/ratios/r.html?t=AMD'
keystat_url = 'http://financials.morningstar.com/financials/getKeyStatPart.html'

with requests.Session() as session:
    session.headers = {'User-Agent': 'Mozilla/5.0 (Linux; U; Android 4.0.3; ko-kr; LG-L160L Build/IML74K) AppleWebkit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30'}

    # visit the target url
    session.get(url)

    params = {
        'callback': '',
        't': 'XNAS:AMD',
        'region': 'usa',
        'culture': 'en-US',
        'cur': '',
        'order': 'asc',
        '_': '1426047023943'
    }
    response = session.get(keystat_url, params=params)

    # get the HTML part from the JSON response
    soup = BeautifulSoup(response.json()['componentData'])

    # grab the data
    for row in soup.select('div#tab-growth table tr'):
        print row.text
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • 1
    That is a great answer alecxe. How did you find the `keystat_url = 'http://financials.morningstar.com/financials/getKeyStatPart.html'` and also how did you find the `params` ? If Instead of AMD I want data for another company let's say AAPL, will the `params` stay the same? Finally how did you come up with the content of `session.headers`? Thank you very much for the help. – TJ1 Mar 11 '15 at 04:31
  • 1
    @TJ1 the first step was to identify whether `requests` would get me the desired `div#tab-growth` just by getting the `http://financials.morningstar.com/ratios/r.html?t=AMD` page. The data was not there, which meant it is loaded asynchronously - I've used browser developer tools to inspect what requests are being sent during the page load and found an XHR request to `getKeyStatPart` endpoint - inspected parameters and repeated them in the code. `User-Agent` header is there just to pretend being a browser (not sure if it is required in this case). – alecxe Mar 11 '15 at 04:34
  • 1
    @TJ1 if you are going to change the company, you would need to change the `t` parameter value also. – alecxe Mar 11 '15 at 04:34
  • what is the value `1426047023943` in the `params` list? can I omit this? – TJ1 Mar 11 '15 at 04:41
  • @TJ1 well, can I please leave this part for you to research and try? :) – alecxe Mar 11 '15 at 04:42
  • :-) it is fair, sure I did a little bit of research and could find how you came up with the link. Let me ask one more question as you are such an expert and then I do the rest of research if it is required. Usin Chrome inspect element I found this `GET` request: `http://financials.morningstar.com/financials/getKeyStatPart.html?&callback=jsonp1426046772712&t=XNAS:AMD&region=usa&culture=en-US&cur=&order=asc&_=1426046772885`. I see the parameters there. But for `callback` the value is `jsonp1426046772712`, why did you leave it as empty? – TJ1 Mar 11 '15 at 04:47
  • 1
    @TJ1 sure, thanks. Since this is a `JSONP` - the callback value affects the actual response - in case of the `jsonp1426046772712` value - the json response would be wrapped around like this `jsonp1426046772712([...json..])`. In case of an empty `callback` value we get a plain JSON that can be read using `response.json()` directly. Hope that makes sense. – alecxe Mar 11 '15 at 04:49
  • @ alecxe I really appreciate all the help, you are a master :) – TJ1 Mar 11 '15 at 04:58