I'm trying to get content from a url and parse the response using BeautyfulSoup.
This url when loaded it retrieves my favourite watchlist items, the problem is that when the site loads it takes a couple of seconds to displays the data in a table, so when I run urlopen(my_url)
the response has no table, therefore my parsing method fails.
I'm trying to keep it simple as I'm learning the language so I would like to use the tools I've already setup in me code so based on what I have I wonder if there is a way to wait, or check when the content is ready for me to be able to fetch the data (table content).
Here is my code:
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as ureq
from urllib.error import URLError, HTTPError
URL = 'url route goes here' # In compliance to SO rules I've removed the website path
def get_dom_from_url():
try:
u_client = ureq(URL)
html = u_client.read()
u_client.close()
except HTTPError as e:
print(f'There has been an HTTP ERROR: {e.code}')
except URLError as e:
print(f'There has been a problem reaching the URL. ERROR: {e.code}')
finally:
print('''
DOM loaded!
''')
return html
dom = soup(get_dom_from_url(), 'html.parser')
# Crawl the dom object and get the table thead element
col_names = [col.text for col in dom.table.thead.find_all('th')]
col_names = col_names[1:-2]
col_names
This is the error message:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-102-625de133b2e2> in <module>
----> 1 col_names = [col.text for col in dom.table.thead.find_all('th')]
2 col_names = col_names[1:-2]
3 col_names
AttributeError: 'NoneType' object has no attribute 'thead'
The code above works, when I load the url without the route, but I need it because I need to store the same data for an ETL pipeline I working on.
If there is no way to achieve this using only urllib
I would like to hear your suggestions.