I have been trying to improve my knowledge with Python and I think the code is pretty forward. However I do dislike abit the coding style I have done where I use too much try except in a content there it might not needed to be at first place and also to try to avoid the silenced expetions.
My goal is basically to have a ready payload before scraping as you will see at the top of the code. Those should be always declared before scraping. What i'm trying to do basically is to try to scrape those different data. If we don't find the data, then it should skip or set the value to [], None or False (Depending on what we are trying to do).
I have read abit regarding getattr
and isinstance
functions but im not sure if there might be a better way than using lots of Try except as a cover if it doesn't find the element on the webpage.
import requests
from bs4 import BeautifulSoup
payload = {
"name": "Untitled",
"view": None,
"image": None,
"hyperlinks": []
}
site_url = "https://stackoverflow.com/questions/743806/how-to-split-a-string-into-a-list"
response = requests.get(site_url)
bs4 = BeautifulSoup(response.text, "html.parser")
try:
payload['name'] = "{} {}".format(
bs4.find('meta', {'property': 'og:site_name'})["content"],
bs4.find('meta', {'name': 'twitter:domain'})["content"]
)
except Exception: # noqa
pass
try:
payload['view'] = "{} in total".format(
bs4.find('div', {'class': 'grid--cell ws-nowrap mb8'}).text.strip().replace("\r\n", "").replace(" ", ""))
except Exception:
pass
try:
payload['image'] = bs4.find('meta', {'itemprop': 'image primaryImageOfPage'})["content"]
except Exception:
pass
try:
payload['hyperlinks'] = [hyperlinks['href'] for hyperlinks in bs4.find_all('a', {'class': 'question-hyperlink'})]
except Exception: # noqa
pass
print(payload)
EDIT:
Example to get incorrect value is to set any find bs4 elements to something else etc:
site_url = "https://stackoverflow.com/questions/743806/how-to-split-a-string-into-a-list"
response = requests.get(site_url)
bs4 = BeautifulSoup(response.text, "html.parser")
print(bs4.find('meta', {'property': 'og:site_name'})["content"]) # Should be found
print(bs4.find('meta', {'property': 'og:site_name_TEST'})["content"]) # Should give us an error due to not found