Beautifulsoup not catching the content

Question

I'm completely new to Python, but I wrote some code in order to parse the content from different sites, using Beuautifulsoup. This code should catch all the <article> tags from a site, or if that is not available it should catch the <p> tags. It works fine most of the time, but there are some sites where it gives back an error, though checking the site, there are <p> tags with content in it, so it should give back the text between the <p> tags.

import requests
import sys
from bs4 import BeautifulSoup

try:
    source = requests.get('https://reactpodcast.com/episodes/96').text
except:
    print('Site does not exist')
    sys.exit()

soup = BeautifulSoup(source, 'lxml')
div_s = soup.find_all('div')
title = soup.find('title')
article = soup.find('article')

content = soup.find_all('p')
allContent = ""
for c in content:
  allContent += c.text
    
yt_title = soup.find('span', class_='watch-title')
yt_description = soup.find('p', attrs={'id': 'eow-description'})
try:
    if article != None:
        print(title.text)
        print(article.text)
    elif "https://www.youtube.com" in source:
        print(yt_title.text)
        print(yt_description.text)
    elif article == None:
        print(title.text)
        print(allContent)
    else:
        print('There is an error')
except:
    print('This URL is invalid')
    sys.exit()

Does anyone have any recommendations (tips & tricks) to solve this problem?

Thank you in advance!

hi there good day dear thanks for the example: great- youre gathering data from two sites and collect the data ... that is great — zero, Jun 29 '20 at 13:48

score 0 · Answer 1 · answered Jun 29 '20 at 12:48

0

I used to have this problem. This is probably due to Javascript. I recommend using Selenium to bypass this problem : How to use Selenium with Python?.

answered Jun 29 '20 at 12:48

ounessy

1

Is it also good to get the
tags from let's say, 100 different sites?
– Balázs Kemenesi Jun 30 '20 at 07:21
Selenium is slower than the normal approach ("requests" ). Some websites generate its content using JS. Thus, the normal approach cannot get this content. Selenium allow you to open the website on a browser and get all the content generated with JS. – ounessy Jun 30 '20 at 08:09

score 0 · Answer 2 · answered Jun 29 '20 at 12:56

I can suggest some improvements for your code:

It's incorrect to compare your object with None like something != None, you can read about it in this article: https://realpython.com/python-is-identity-vs-equality.
It's better to compare them like something is not None, or something is None
Avoid using except without specifying Error or Exception name. You can find some useful information here: https://www.techbeamers.com/use-try-except-python/

Good luck!

Beautifulsoup not catching the content

2 Answers2