How do I scrape the OHLC values from this website

Question

Website in question. Right now I am only performing analysis on the last quarter, if I was to expand to the past 4-5 quarters would there be a better way of automating this task rather than doing it manually by setting the time range again and again and then extracting the table values?

What I tried doing:

import bs4 as bs
import requests
import lxml
resp = requests.get("http://www.scstrade.com/stockscreening/SS_CompanySnapShotHP.aspx?symbol=HBL")
soup = bs.BeautifulSoup(resp.text, "lxml")
mydivs = soup.findAll("div", {"class": "breadcrumbs"})
print(mydivs)

What I got:

[<div class="breadcrumbs">
<ul>
<li class="breadcrumbs-home">
<a href="#" title="Back To Home">
<i class="fa fa-home"></i>
</a>
</li>
<li>Snapshot   /   <span id="ContentPlaceHolder1_lbl_companyname">HBL - Habib Bank Ltd.</span>   /   Historical Prices
                    </li>
</ul>
</div>, <div class="breadcrumbs" style="background-color:transparent;border-color:transparent;margin-top:20px;">
<ul>
<div class="bootstrap-iso">
<div class="tp-banner-container">
<div class="table-responsive">
<div id="n1">
<table class="table table-bordered table-striped" id="list"><tr><td>Company Wise</td></tr></table>
<div id="pager"></div>
</div>
</div>
</div>
</div>
</ul>
</div>]

Inspecting the source the table is in the div class called "breadcrumbs" (I got that through the "inspect element" thingy) but I dont see the place where all the values are defined/stored in the pages source. Kinda new to web scraping where should I be looking to extract those values here?

Also there are a total of 7 pages and Im currently only trying to scrape the table off from the first oage, how would I go about scraping all x pages of my results and then convert them to a pandas dataframe..

Andrej Kesely · Accepted Answer · 2019-08-12T16:25:05.813

1

The page loads the data via Javascript from external source. By inspecting where the page is making requests, you can load the data with json module.

You can tweak the parameters in the payload dict to get the data for date range you want:

import json
import requests

url = 'http://www.scstrade.com/stockscreening/SS_CompanySnapShotHP.aspx/chart'
payload = {"par":"HBL","date1":"07/13/2019","date2":"08/12/2019","rows":20,"page":1,"sidx":"trading_Date","sord":"desc"}

json_data = requests.post(url, json=payload).json()
print(json.dumps(json_data, indent=4))

Prints:

{
    "d": [
        {
            "trading_Date": "/Date(1565290800000)/",
            "trading_open": 111.5,
            "trading_high": 113.24,
            "trading_low": 105.5,
            "trading_close": 106.17,
            "trading_vol": 1349000,
            "trading_change": -4.71
        },
        {
            "trading_Date": "/Date(1565204400000)/",
            "trading_open": 113.94,
            "trading_high": 115.0,
            "trading_low": 110.0,
            "trading_close": 110.88,
            "trading_vol": 1122200,
            "trading_change": -3.48
        },

    ... and so on.

EDIT:

I found the URL from where the page is loading data by looking at Network tab in Firefox developer tools:

There is URL, the method how the page is making requests (POST in this case) and parameters needed:

I copy this URL and parameters and use it in requests.post() method to obtain json data.

edited Aug 12 '19 at 16:25

answered Aug 12 '19 at 15:49

Andrej Kesely

168,389
15
48
91

Could you please edit your post and explain what your code is doing in more detail? I understand it on a superficial level but I dont think Id be able to write something like this myself currently, Why is it that when I go to that url variable from my browser's address bar it gives me a server side error? And im not sure how or where you got all of those parameters in payload from (sord, sidx) – Ajwad Aug 12 '19 at 16:20
@AjwadJaved I edited my answer and added little bit of explanation (with screenshots) – Andrej Kesely Aug 12 '19 at 16:25
Thank you so much!! Just one last question: what would be the best way to format that output your code gives me into a pandas dataframe? Ideally what we want to do is to first create the headers from the first {} and after that just keep getting the values (by searching for the second "") and keep throwing them into our dataframe. Will have to format that ugly date format along the way too, not even sure which format it is in currently – Ajwad Aug 12 '19 at 16:37
@AjwadJaved Unfortunatelly I cannot help you with Pandas (I don't have it installed). But this could help https://stackoverflow.com/questions/21104592/json-to-pandas-dataframe You can parse the date by extracting the numbers, stripping last 3 zeros (it's UNIX time stamp) – Andrej Kesely Aug 12 '19 at 16:45
Okay so I'm currently looking into normalising our json dataset but [our code is only returning 4 results?](https://i.imgur.com/sWqfHIA.png) – Ajwad Aug 12 '19 at 17:21
@AjwadJaved Try to specify larger date `"date1":"01/01/2019","date2":"06/01/2019"` for example. – Andrej Kesely Aug 12 '19 at 17:25
1

Turns out I was using the wrong format: "date1":"01/01/2019","date2":"01/06/2019" instead of "date1":"01/01/2019","date2":"06/01/2019". Thanks again for everything man – Ajwad Aug 12 '19 at 18:29
Also just thought I'd let you know (and for anyone else visiting this thread later on), this line did the trick and converted the json to a pandas dataframe: pd.io.json.json_normalize(json_data['d'], errors='ignore') Now just have to convert the date – Ajwad Aug 12 '19 at 18:34
@andrejkesely, great answer! Could you suggest any readings on how to parse pages that use JavaScript to load its content? – help-ukraine-now Aug 19 '19 at 20:07
@politicalscientist Parsing Javascript manually can be hard - I would recommend to look at `selenium` module. Of course, often the javascript data can be parsed manually. Then regular expressions are required (`re` module). – Andrej Kesely Aug 19 '19 at 20:11
1

Thanks. I never thought that requests.post could be used to parse the data. I’ll also look at selenium, thanks! – help-ukraine-now Aug 19 '19 at 20:27

How do I scrape the OHLC values from this website

1 Answers1