2

For this simple BeautifulSoup experiment, I am trying to scrape some simple data from an IMDB page https://www.imdb.com/title/tt7069210/

The problem is I am unable to get the elements with class rec_item. I have tried many selectors to get the hold of it, but each time it is giving back a blank list.

Now, why I think it is strange is:

  • The elements with rec_item are not inside any iFrame.
  • The elements can be seen by doing view page source on browser. Therefore, as per my understanding, they are NOT loaded by javascript after page load.

Here is the repl.it link of the code

Question: Can anyone please help me understand why the list of rec_item is blank?

Additional Information

Here is the code,

from bs4 import BeautifulSoup
import requests


def extract(url):
    res = requests.get(url)
    bsoup = BeautifulSoup(res.text, 'html.parser')
    the_title = bsoup.select('meta[name="title"]')[0].attrs['content']
    print('Title: ' + the_title)    # This works fine

    long_text = bsoup.select('#titleStoryLine .inline.canwrap span')[0].string.strip()
    print('Description: ' + long_text)    # this too works fine

    similar_movies = bsoup.select('.rec_item')
    print(similar_movies)   # blank array :(


extract('https://www.imdb.com/title/tt7069210/')

Browser's View Page Source Browser's View Page Source

And here is the output from repl.it Code output from repl

Suman Barick
  • 3,311
  • 2
  • 19
  • 31
  • Recommendations are dynamically loaded with js (they are not in the body you download). You can't do it with requests, try it with selenium. – Lucas May 17 '21 at 15:37
  • 1
    @Lucas But they are available in "view page source" and that's what making it mysterious to me. As this SO link's (below) accepted answer says: "View Source in the browser shows you the original HTML source of the page - exactly what came from the server before any client side modifications have been made. As such, it will NOT include any dynamic changes to the page made by javascript."[https://stackoverflow.com/questions/25215813/can-new-elements-inserted-with-javascript-be-seen-with-view-source#:~:text=View%20Source%20in%20the%20browser,the%20page%20made%20by%20javascript.] – Suman Barick May 17 '21 at 16:03

1 Answers1

2

You have to add headers to get a proper HTML and not some thrid grade bot wannabe hypertext.

Here's how to get this done:

import requests
from bs4 import BeautifulSoup

headers = {
    "user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.86 YaBrowser/21.3.0.740 Yowser/2.5 Safari/537.36"
}


def extract(url):
    res = requests.get(url, headers=headers)
    soup = BeautifulSoup(res.text, 'html.parser')
    the_title = soup.select('meta[name="title"]')[0].attrs['content']
    print('Title: ' + the_title)  # This works fine

    long_text = soup.select('#titleStoryLine .inline.canwrap span')[0].string.strip()
    print('Description: ' + long_text)  # this too works fine

    similar_movies = soup.select('.rec_item img')
    print([i["title"] for i in similar_movies])  # works now :)


extract('https://www.imdb.com/title/tt7069210/')

Output:

Title: The Conjuring 3: The Devil Made Me Do It (2021) - IMDb
Description: A chilling story of terror, murder and unknown evil that shocked even experienced real-life paranormal investigators Ed and Lorraine Warren. One of the most sensational cases from their files, it starts with a fight for the soul of a young boy, then takes them beyond anything they'd ever seen before, to mark the first time in U.S. history that a murder suspect would claim demonic possession as a defense.
['The Conjuring 2', 'The Conjuring 2 Remake', 'The Conjuring', 'The Maiden', 'Conjuring the Devil', 'Billie Eilish: Bury a Friend', 'Oxygen', 'The Curse of La Llorona', 'Annabelle Comes Home', 'Shang-Chi and the Legend of the Ten Rings', 'Malignant', 'The Nun']
baduker
  • 19,152
  • 9
  • 33
  • 56
  • "not some thrid grade bot wannabe hypertext" :D Thanks a lot Sir :). So, the server was stripping some HTML away because it was sensing the request is not from a real browser? Can you direct me to a doc/url where I can learn more about this behavior please? Would be a great help. Thanks again man ... – Suman Barick May 18 '21 at 05:25