I'm completely new to Python, but I wrote some code in order to parse the content from different sites, using Beuautifulsoup. This code should catch all the <article>
tags from a site, or if that is not available it should catch the <p>
tags. It works fine most of the time, but there are some sites where it gives back an error, though checking the site, there are <p>
tags with content in it, so it should give back the text between the <p>
tags.
import requests
import sys
from bs4 import BeautifulSoup
try:
source = requests.get('https://reactpodcast.com/episodes/96').text
except:
print('Site does not exist')
sys.exit()
soup = BeautifulSoup(source, 'lxml')
div_s = soup.find_all('div')
title = soup.find('title')
article = soup.find('article')
content = soup.find_all('p')
allContent = ""
for c in content:
allContent += c.text
yt_title = soup.find('span', class_='watch-title')
yt_description = soup.find('p', attrs={'id': 'eow-description'})
try:
if article != None:
print(title.text)
print(article.text)
elif "https://www.youtube.com" in source:
print(yt_title.text)
print(yt_description.text)
elif article == None:
print(title.text)
print(allContent)
else:
print('There is an error')
except:
print('This URL is invalid')
sys.exit()
Does anyone have any recommendations (tips & tricks) to solve this problem?
Thank you in advance!