0

I have been trying to improve my knowledge with Python and I think the code is pretty forward. However I do dislike abit the coding style I have done where I use too much try except in a content there it might not needed to be at first place and also to try to avoid the silenced expetions.

My goal is basically to have a ready payload before scraping as you will see at the top of the code. Those should be always declared before scraping. What i'm trying to do basically is to try to scrape those different data. If we don't find the data, then it should skip or set the value to [], None or False (Depending on what we are trying to do). I have read abit regarding getattr and isinstance functions but im not sure if there might be a better way than using lots of Try except as a cover if it doesn't find the element on the webpage.

import requests
from bs4 import BeautifulSoup

payload = {
    "name": "Untitled",
    "view": None,
    "image": None,
    "hyperlinks": []
}

site_url = "https://stackoverflow.com/questions/743806/how-to-split-a-string-into-a-list"

response = requests.get(site_url)

bs4 = BeautifulSoup(response.text, "html.parser")

try:
    payload['name'] = "{} {}".format(
        bs4.find('meta', {'property': 'og:site_name'})["content"],
        bs4.find('meta', {'name': 'twitter:domain'})["content"]
    )
except Exception:  # noqa
    pass

try:
    payload['view'] = "{} in total".format(
        bs4.find('div', {'class': 'grid--cell ws-nowrap mb8'}).text.strip().replace("\r\n", "").replace(" ", ""))
except Exception:
    pass

try:
    payload['image'] = bs4.find('meta', {'itemprop': 'image primaryImageOfPage'})["content"]
except Exception:
    pass

try:
    payload['hyperlinks'] = [hyperlinks['href'] for hyperlinks in bs4.find_all('a', {'class': 'question-hyperlink'})]

except Exception:  # noqa
    pass

print(payload)

EDIT:

Example to get incorrect value is to set any find bs4 elements to something else etc:

site_url = "https://stackoverflow.com/questions/743806/how-to-split-a-string-into-a-list"

response = requests.get(site_url)

bs4 = BeautifulSoup(response.text, "html.parser")

print(bs4.find('meta', {'property': 'og:site_name'})["content"]) # Should be found
print(bs4.find('meta', {'property': 'og:site_name_TEST'})["content"]) # Should give us an error due to not found
PythonNewbie
  • 1,031
  • 1
  • 15
  • 33
  • I did get a downvote without any comment. I would be appreciated if I knew what was the reason for it as I can learn from it :) – PythonNewbie Feb 21 '21 at 18:17

1 Answers1

1

From the documentation, find returns None when it doesn't find anything while find_all returns an empty list []. You can check that the results are not None before trying to index.

import requests
from bs4 import BeautifulSoup

payload = {
    "name": "Untitled",
    "view": None,
    "image": None,
    "hyperlinks": []
}

site_url = "https://stackoverflow.com/questions/743806/how-to-split-a-string-into-a-list"

response = requests.get(site_url)

bs4 = BeautifulSoup(response.text, "html.parser")

try:
    prop = bs4.find('meta', {'property': 'og:site_name'})
    name = bs4.find('meta', {'name': 'twitter:domain'})
    if prop is not None and name is not None:
        payload['name'] = "{} {}".format(prop["content"], name["content"])

    div = bs4.find('div', {'class': 'grid--cell ws-nowrap mb8'})
    if div is not None:
        payload['view'] = "{} in total".format(div.text.strip().replace("\r\n", "").replace(" ", ""))

    itemprop = bs4.find('meta', {'itemprop': 'image primaryImageOfPage'})
    if itemprop is not None:
        payload['image'] = itemprop["content"]
        
    payload['hyperlinks'] = [hyperlinks['href'] for hyperlinks in bs4.find_all('a', {'class': 'question-hyperlink'})]
except Exception:  # noqa
    pass

print(payload)

So you can use one try/except. If you want to handle exceptions differently you can have different except blocks for them.

try:

    ...

except ValueError: 
    value_error_handler()
except TypeError:
    type_error_handler()
except Exception:
    catch_all()
Jimi
  • 539
  • 5
  • 21
  • Hmmm. But then if etc payload Name fails, then it wont continue for the rest of the scraping? Am I correct? – PythonNewbie Feb 21 '21 at 18:54
  • Sorry for the late response! I really appreciate it! It looks alot better than I expected and also did some testing which was pretty well! Than you so much Jimi! :) – PythonNewbie Feb 24 '21 at 11:36