Python: BeautifulSoup object isn't the same as actual source code

Question

Could I have an explanation/solution on why the soup object does not have the h3 element?

import requests
from bs4 import BeautifulSoup

response = requests.get(url="https://www.empireonline.com/movies/features/best-movies-2/")
webpage = response.text

soup = BeautifulSoup(webpage, "html.parser")
main_list = soup.select(selector="h3.jsx-4245974604")

print(soup.prettify())

This is my python code if necessary.

score 1 · Accepted Answer · answered Jun 30 '21 at 02:28

1

The page is loaded dynamically, so requests doesn't support it. However, the data is available in JSON format on the website, you can extract it using only the re/json modules. Using BeautifulSuop is not required.

import re
import requests


response = requests.get("https://www.empireonline.com/movies/features/best-movies-2/")

for title in re.findall(r'"titleText":"(.*?)",', str(response.content))[1:]:  # <- Using [1:] since the first title is repeated twice
    print(title)

Output:

100) Stand By Me
99) Raging Bull
98) Amelie
97) Titanic
96) Good Will Hunting
95) Arrival
94) Lost In Translation
...

See also:

Web-scraping JavaScript page with Python

answered Jun 30 '21 at 02:28

MendelG

14,885
4
25
52

Could you explain to me what this string is? `r'"titleText":"(.*?)",'` – Sai Nallani Jun 30 '21 at 02:37
@SaiNallani This is called a [regular expression](https://en.wikipedia.org/wiki/Regular_expression) pattern. in this case, we're basically looking for all text that comes after "titleText", which contains the titles we're looking for. See [this demo on regex101](https://regex101.com/r/qjkCgJ/1) for an explanation – MendelG Jun 30 '21 at 02:42

Python: BeautifulSoup object isn't the same as actual source code

1 Answers1