How to check if a page content is loaded in Python using urllib?

Question

I'm trying to get content from a url and parse the response using BeautyfulSoup.

This url when loaded it retrieves my favourite watchlist items, the problem is that when the site loads it takes a couple of seconds to displays the data in a table, so when I run urlopen(my_url) the response has no table, therefore my parsing method fails.

I'm trying to keep it simple as I'm learning the language so I would like to use the tools I've already setup in me code so based on what I have I wonder if there is a way to wait, or check when the content is ready for me to be able to fetch the data (table content).

Here is my code:

from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as ureq
from urllib.error import URLError, HTTPError

URL = 'url route goes here' # In compliance to SO rules I've removed the website path

def get_dom_from_url():
    try:
        u_client = ureq(URL)
        html = u_client.read()
        u_client.close()
    except HTTPError as e:
        print(f'There has been an HTTP ERROR: {e.code}')
    except URLError as e:
        print(f'There has been a problem reaching the URL. ERROR: {e.code}')
    finally:
        print('''
DOM loaded!
        ''')
        return html

dom = soup(get_dom_from_url(), 'html.parser')

# Crawl the dom object and get the table thead element
col_names = [col.text for col in dom.table.thead.find_all('th')]
col_names = col_names[1:-2]
col_names

This is the error message:


---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-102-625de133b2e2> in <module>
----> 1 col_names = [col.text for col in dom.table.thead.find_all('th')]
      2 col_names = col_names[1:-2]
      3 col_names

AttributeError: 'NoneType' object has no attribute 'thead'

The code above works, when I load the url without the route, but I need it because I need to store the same data for an ETL pipeline I working on.

If there is no way to achieve this using only urllib I would like to hear your suggestions.

Can you pass `URL` as an parameter in that function like `get_dom_from_url(URL)` and thereafter declare `dom` like `dom = soup(get_dom_from_url(URL), 'html.parser')` — Umutambyi Gad, Feb 21 '21 at 10:28
Yes but I end up in the same situation, the site takes a couple of seconds to fetch the watchlist and display the data — Ricardo Sanchez, Feb 21 '21 at 10:32
Does this answer your question? [scrape html generated by javascript with python](https://stackoverflow.com/questions/2148493/scrape-html-generated-by-javascript-with-python) — user202729, Feb 21 '21 at 10:34
Besides, questions that depends on random external website are usually considered as not having a [example]. — user202729, Feb 21 '21 at 10:35
@user202729 the website is not random, the question targets scraping a particular website. Removing the website from the sample code actually makes it irreproducible — RJ Adriaansen, Feb 21 '21 at 11:25
@RJAdriaansen I mean... it's true... I know that... but if the mentioned site is down then the question is not reproducible as well. https://chat.stackoverflow.com/transcript/41570?m=51606808#51606808 (too bad I can't find a proper meta post about that) — user202729, Feb 21 '21 at 11:42
@user202729 I understand, but cases like this, that concern scraping data loaded dynamically through javascript, are not easily turned into self-contained reproducable examples. Let's not be too strict on principle. — RJ Adriaansen, Feb 21 '21 at 12:17

score 1 · Answer 1 · answered Feb 21 '21 at 11:21

As mentioned in the comments, urllib.request is quite ancient, and Selenium can handle javascript:

from selenium import webdriver
from bs4 import BeautifulSoup as soup

options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
wd = webdriver.Chrome('chromedriver',options=options)
wd.get("https://coinmarketcap.com/watchlist/60321ee5b01cab343e1e37d6")

dom = soup(wd.page_source, 'html.parser')

col_names = [col.text for col in dom.table.thead.find_all('th')]
col_names = col_names[1:-2]
col_names

Output:

['Name', 'Price', '24h', '7d', 'Market Cap', 'Volume', 'Circulating Supply']

I'm not sure what your purpose is, but you could also load tables directly into pandas without using BeautifulSoup:

import pandas as pd
df = pd.read_html(wd.page_source)[0]
df = df.iloc[:, 1:-2] # drop first and last two columns

Output df.head():

|    | Name             | Price      | 24h   | 7d      | Market Cap         | Volume                           | Circulating Supply   |
|---:|:-----------------|:-----------|:------|:--------|:-------------------|:---------------------------------|:---------------------|
|  0 | Bitcoin1BTC      | $57,515.75 | 3.67% | 17.00%  | $1,069,904,656,718 | $65,190,502,8631,135,430 BTC     | 18,634,643 BTC       |
|  1 | Ethereum2ETH     | $1,972.76  | 1.32% | 7.03%   | $225,668,814,800   | $30,790,664,03815,658,144 ETH    | 114,760,589 ETH      |
|  2 | Binance Coin3BNB | $285.07    | 4.20% | 113.31% | $43,505,990,437    | $9,923,804,82135,249,242 BNB     | 154,532,785 BNB      |
|  3 | Polkadot4DOT     | $39.69     | 4.23% | 39.72%  | $36,117,982,640    | $5,555,973,830139,996,164 DOT    | 910,079,701 DOT      |
|  4 | Cardano5ADA      | $1.14      | 8.25% | 31.00%  | $35,603,496,069    | $10,299,401,9179,000,239,285 ADA | 31,112,484,646 ADA   |

Thank you, it seems Selenium is the way to go but I'm having a hard time trying to setup all Selenium+Mac+Edge ... I'll keep trying and report back later — Ricardo Sanchez, Feb 21 '21 at 20:18
Using Edge on a Mac has giving me too much hassle, switch to ChromeDriver and works like a charm — Ricardo Sanchez, Feb 22 '21 at 15:57

score 1 · Accepted Answer · answered Feb 22 '21 at 09:48

Actually you don't need to use Selenium here. The data is embedded in the source html in the <script> tags in a valid json format. Just need to parse that:

import requests
from bs4 import BeautifulSoup
import json
import pandas as pd

url = 'https://coinmarketcap.com/watchlist/60321ee5b01cab343e1e37d6/'

response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
jsonStr = soup.find('script', {'id':'__NEXT_DATA__'}).text

jsonData = json.loads(jsonStr)

data = jsonData['props']['initialProps']['pageProps']['fetchedWatchlist']['cryptoCurrencies']
rows = []
for each in data:
    quotes_row = each.pop('quotes')[0]
    each.pop('tags')
    if 'platform' in each.keys():
        each.pop('platform')
    each.update(quotes_row)
    
    rows.append(each)

df = pd.DataFrame(rows)

Output:

print(df.to_string())
     id name symbol          slug  status  rank  marketPairCount  circulatingSupply   totalSupply     maxSupply               lastUpdated                 dateAdded         price     volume24h     marketCap  percentChange1h  percentChange24h  percentChange7d
0     1  USD    BTC       bitcoin  active     1             9717       1.863544e+07  1.863544e+07  2.100000e+07  2021-02-22T09:37:02.000Z  2013-04-28T00:00:00.000Z  55579.249971  5.656584e+10  1.035744e+12        -1.232746         -1.234765        16.978174
1  1027  USD    ETH      ethereum  active     2             5982       1.147732e+08  1.147732e+08           NaN  2021-02-22T09:37:02.000Z  2015-08-07T00:00:00.000Z   1855.072456  2.450605e+10  2.129125e+11        -1.373583         -4.104364         5.315240
2  1839  USD    BNB  binance-coin  active     3              469       1.545328e+08  1.705328e+08  1.705328e+08  2021-02-22T09:37:11.000Z  2017-07-25T00:00:00.000Z    272.095668  6.811884e+09  4.204770e+10        -2.381284          2.937286       109.533310
3   825  USD   USDT        tether  active     4            10829       3.445054e+10  3.570817e+10           NaN  2021-02-22T09:37:08.000Z  2015-02-25T00:00:00.000Z      0.999576  1.087710e+11  3.443593e+10        -0.061248         -0.023795        -0.074917
4  6636  USD    DOT  polkadot-new  active     5              145       9.103144e+08  1.045967e+09           NaN  2021-02-22T09:36:05.000Z  2020-08-19T00:00:00.000Z     37.503515  3.257901e+09  3.413999e+10        -1.327435         -2.635214        40.263648
5  2010  USD    ADA       cardano  active     6              231       3.111248e+10  4.500000e+10  4.500000e+10  2021-02-22T09:37:09.000Z  2017-10-01T00:00:00.000Z      1.040491  6.621492e+09  3.237226e+10        -1.594681         -7.316003        25.951127
6    52  USD    XRP           xrp  active     7              673       4.540403e+10  9.999083e+10  1.000000e+11  2021-02-22T09:38:03.000Z  2013-08-04T00:00:00.000Z      0.581321  1.102498e+10  2.639430e+10        -1.640063         11.286157         2.731301
7     2  USD    LTC      litecoin  active     8              754       6.653055e+07  6.653055e+07  8.400000e+07  2021-02-22T09:38:02.000Z  2013-04-28T00:00:00.000Z    216.783950  6.530638e+09  1.442276e+10        -2.134667         -3.477237         5.932102
8  1975  USD   LINK     chainlink  active     9              471       4.085096e+08  1.000000e+09  1.000000e+09  2021-02-22T09:37:11.000Z  2017-09-20T00:00:00.000Z     32.145503  1.885830e+09  1.313174e+10        -1.378857         -5.152372        -0.036835
9  1831  USD    BCH  bitcoin-cash  active    10              581       1.866177e+07  1.866177e+07  2.100000e+07  2021-02-22T09:37:07.000Z  2017-07-23T00:00:00.000Z    679.047253  5.800439e+09  1.267222e+10        -1.298651         -0.162108        -1.595937

Thank you!, I spend a day hacking my Edge+Mac+Selenium setup, had to switch to ChromeDriver in the end, but since this actually solves my problem I'm accepting yours as the correct answer — Ricardo Sanchez, Feb 22 '21 at 15:59
Ya selenium is good. I like to use it too, but as a last resort. I always go in this order; 1) check for api in XHR 2) check if it’s embedded in a script tag, 3) check if I need to create a session and post to get response, 4) is the same data available on another site, 5) and if none of those look promising, use selenium. — chitown88, Feb 22 '21 at 20:20
There’s also html-requests package that is suppose to be like requests but with JavaScript support. But I’ve never used it as it conflicts with my IDE, and never really *needed* to use it. — chitown88, Feb 22 '21 at 20:21

How to check if a page content is loaded in Python using urllib?

2 Answers2