How to act when not receiving the data when scraping with python?

Question

This site has data on stock and I'm trying to sub struct some data from this site. https://quickfs.net/company/AAPL:US

Where AAPL is a stock name and can be changed.

the page looks like a big table : the columns are years and the rows are calculated values like: Return on Assets and Gross Margin

For this I tried to follow few tutorials:

Introduction to Web Scraping (Python) - Lesson 02 (Scrape Tables)

Intro to Web Scraping with Python and Beautiful Soup

Web Scraping HTML Tables with Python

Web scraping with Python — A to Z Part A — Handling BeautifulSoup and avoiding blocks

I get stuck right at the beginning after importing the packages:

from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq

this function to retrive the data from the web page:

def make_soup(url):
    thepage=uReq(url)
    soupdata=soup(thepage, "html.parser")
    return(soupdata)

then

soup=make_soup("https://quickfs.net/company/AAPL:US")

Now, when trying to look what data inside the soup

soup.text

The output is just this and not all the data from the webpage:

'\n\n\n\n\n\n\n\n\n\n\n\nExport Fundamental Data U.S. and International Stocks - QuickFS.net\n\n\n\n\n\n  \r\n  Loading QuickFS...\r\n  \n\n\n\n\n\n\n\n\n\n\n\n\n\n'

I think it's a problem with the specific web page but I have no idea how to handle with this.

Entering different url the the function make_soup(url) sometimes do work.

Pleas your kind help

score 0 · Answer 1 · answered Apr 30 '20 at 19:40

That is because that page is fully dynamic, meaning that javascript is doing all the work and BeautifulSoup4 doesn't run JS.

You have to choices here:

A) Switch to something like Selenium
B) Check what XHR messages the site is sending to the api/server and try to emulate that from python.

In the case of B, you would see that the site is making this call:

curl 'https://api.quickfs.net/stocks/AAPL:US/ovr/Annual/' \
-XGET \
-H 'Accept: application/json, text/plain, */*' \
-H 'Content-Type: application/json' \
-H 'Origin: https://quickfs.net' \
-H 'Accept-Language: en-us' \
-H 'Host: api.quickfs.net' \
-H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.1 Safari/605.1.15' \
-H 'Referer: https://quickfs.net/company/AAPL:US' \
-H 'Accept-Encoding: gzip, deflate, br' \
-H 'Connection: keep-alive' \
-H 'X-Auth-Token: ' \
-H 'X-Referral-Code: '

What you can do is this instead:

import requests

response = request.get("https://api.quickfs.net/stocks/AAPL:US/ovr/Annual/")
data = response.json()

Where data will be the raw data that the site uses to present the info:

{
    "datasets": {
        "metadata": {
            "_id": {},
            "qfs_symbol": "NAS:AAPL",
            "currency": "USD",
            "fsCat": "normal",
            "name": "Apple Inc.",
            "gs3_version_at_metadata_update": 20191106,
            "exchange": "NASDAQ",
            "industry": "Technology Hardware & Equipment",
            "symbol": "AAPL",
            "country": "US",
            "price": 278.58,
        ...
    }
}

Thank you ! may i ask how do you know that "that page is fully dynamic" ? and the first block of code you wrote (after "you would see that the site is making this call") how did you get it ? Thanks in advance — TaL, May 01 '20 at 08:31
You're welcome! I always look at the "real" source code of the site (view-source:https://quickfs.net/company/AAPL:US). Usually if you a lot more information in the site than there is in the source it is probably because it was loaded with javascript. The next step would be to figure out how does JS gets that info. If you open the Network tab on the Dev Tools and refresh the site you would see that a XHR call is made to the api (that is AJAX). If you right click the call and click "copy as cURL" you'll get the first block of code that I posted. — Fede Calendino, May 01 '20 at 09:01
Thank you again!! I'd been learning a lot from this! If I may, there is another question. In this page on the top right corner there is a drop-down with the first value as "Overview" and than "Income Statement" and the last is the Key Ratios. I'm trying to extract data from the page of the Key Ratios, but when I'm following the instructions from your last comment to receive the URL for the the Key Ratios page, by doing refresh to the web page and looking at XHR call the web page goes back to the Overview page and I can't find the URL for this Key Ratios page. is there a way to handl it? — TaL, May 03 '20 at 15:18
It seems that those only trigger different ways to present the same data, in that case you might want to check the code of the javascript files of the site, to see how they are treated. — Fede Calendino, May 04 '20 at 16:04

How to act when not receiving the data when scraping with python?

1 Answers1

Linked