0

I need to extract a list of all the genres of any given movie from the movie page on IMDb.

For example:

I tried using Beautiful Soup but I am not able to find the exact class under which the genres are stored.

Following are the snippets I tried:

ul= soup.find("ul", {"class": "ipc-metadata-list ipc-metadata-list--dividers-all sc-388740f9-1 IjgYL ipc-metadata-list--base"})
children = ul.findChildren("a", recursive=False)

This throws an error saying AttributeError: 'NoneType' object has no attribute 'findChildren'

class_selector = "ipc-inline-list__item" 
genre = soup.find_all('li', {'class': class_selector})
list1 = []
for tag in list1:
   list1.append(tag.find('a')).text
print(list1)

This return a list with no entries

Any help would be great!

Image of the website source code

3 Answers3

2

You probably don't need the overheads of selenium/chromedriver setup, instead you can do it with requests:

import requests
from bs4 import BeautifulSoup

r = requests.get('https://www.imdb.com/title/tt0454848/?ref_=adv_li_i')
soup = BeautifulSoup(r.text, 'html.parser')
genres = soup.select_one('div.ipc-chip-list__scroller')
for genre in genres.contents:
    print(genre.text)

This prints out:

Crime
Drama
Mystery

BeautifulSoup documentation can be found at https://www.crummy.com/software/BeautifulSoup/bs4/doc/

UPDATE: To get the desired genre list, you can use selenium only. I will include the full code below:

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
chrome_options = Options()
chrome_options.add_argument("--no-sandbox")

webdriver_service = Service("chromedriver/chromedriver") ## path to where you saved chromedriver binary
browser = webdriver.Chrome(service=webdriver_service, options=chrome_options)

url = 'https://www.imdb.com/title/tt0454848/?ref_=adv_li_i'
# url = 'https://www.imdb.com/title/tt0765429/'

browser.get(url)
browser.execute_script("window.scrollBy(0,2200);")
elem_pulled_from_graphql = WebDriverWait(browser, 20).until(EC.element_to_be_clickable((By.XPATH, '//div[@data-testid="storyline-plot-summary"]')))
genres = elem_pulled_from_graphql.find_elements(By.XPATH, "//a[@class='ipc-metadata-list-item__list-content-item--link']")
genres = WebDriverWait(browser, 20).until(EC.presence_of_all_elements_located((By.XPATH, "//span[text()='Genres']/following-sibling::div//child::li")))
for g in genres:
    print(g.text)

This will print out:

Crime
Drama
Mystery
Thriller

This solution is based on selenium only, and will wait as long as needed (well, up to 20 seconds) for the data to be pulled from database by the graphql query.

Barry the Platipus
  • 9,594
  • 2
  • 6
  • 30
  • Thanks for the prompt reply. I tried running the code you sent but for some reason the list of genres is capped to 3 elements for all the movies I am trying. Is there any reason for the same? Is it possible to get a list of all genres? – Irish Mehta Jul 31 '22 at 22:19
  • Can you give an example with more than 3 genres, where you only get 3? – Barry the Platipus Jul 31 '22 at 22:21
  • For the attached movie itself (Inside Man), there are 4 genres, Crime, Drama, Mystery and Thriller, but I only see Crime, Drama and Mystery printed on the console – Irish Mehta Jul 31 '22 at 22:27
  • 1
    No, there are only 3: https://ibb.co/9sQKCCX Crime, Drama, Mistery – Barry the Platipus Jul 31 '22 at 22:30
  • Oh actually this was not the place I wanted to extract the data from. If you scroll down on the same page, there is a list of all the genres that the movie belongs to, as shown in this image https://i.stack.imgur.com/8xxPA.png – Irish Mehta Jul 31 '22 at 22:35
  • That data is being loaded by javascript, via a graphql call to https://caching.graphql.imdb.com/?operationName=TMD_Storyline&variables=%7B%22titleId%22%3A%22tt0454848%22%7D&extensions=%7B%22persistedQuery%22%3A%7B%22sha256Hash%22%3A%22cbefc9c4a2dbd0a5583e223e5bc788946016db709a731c85251fc1b1b7a1afbe%22%2C%22version%22%3A1%7D%7D. You would need to either scrape that, either use the api offered for free by imdb. – Barry the Platipus Jul 31 '22 at 22:47
  • I added a (pure) selenium solution as well. This will wait for the data to be loaded in page (up to 20 seconds, if needed) and print out the genres. – Barry the Platipus Aug 01 '22 at 00:17
2

According to your Screenshot, to get the list of genre, you can use selenium with bs4 as follows:

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
import time
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By

chrome_options = Options()
chrome_options.add_argument("--no-sandbox")


webdriver_service = Service("./chromedriver") #Your chromedriver path
driver = webdriver.Chrome(service=webdriver_service, options=chrome_options)
url='https://www.imdb.com/title/tt0454848/?ref_=adv_li_i'
driver.get(url)
driver.maximize_window()
time.sleep(5)

soup = BeautifulSoup(driver.page_source,'lxml')
t=soup.select_one('span.ipc-metadata-list-item__label:-soup-contains("Genres")').parent
genre=[x.get_text() for x in t.select('div[class="ipc-metadata-list-item__content-container"] > ul > li')]
print(genre)

Output:

['Crime', 'Drama', 'Mystery', 'Thriller']
Md. Fazlul Hoque
  • 15,806
  • 5
  • 12
  • 32
1

To extract list of all the genres i.e. Crime, Drama, Mystery and Thriller you need to induce WebDriverWait for visibility_of_all_elements_located() and you can use either of the following Locator Strategies:

  • Using CSS_SELECTOR and get_attribute("innerHTML"):

    driver.execute("get", {'url': 'https://www.imdb.com/title/tt0454848/?ref_=adv_li_i'})
    print([my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "li[data-testid='storyline-genres'] a.ipc-metadata-list-item__list-content-item.ipc-metadata-list-item__list-content-item--link")))])
    
  • Using XPATH and text attribute:

    driver.execute("get", {'url': 'https://www.imdb.com/title/tt0454848/?ref_=adv_li_i'})     
    print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//li[@data-testid='storyline-genres']//a[@class='ipc-metadata-list-item__list-content-item ipc-metadata-list-item__list-content-item--link']")))])
    
  • Console Output:

    ['Crime', 'Drama', 'Mystery', 'Thriller']
    
  • Note : You have to add the following imports :

    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    
undetected Selenium
  • 183,867
  • 41
  • 278
  • 352
  • Thank you for replying! I'm able to extract the top 3 genres from the methods you suggest but I'm looking for an exhaustive list of genres as mentioned when you scroll down on the same page. The exhaustive list is mentioned in this part of the page i.stack.imgur.com/8xxPA.png – Irish Mehta Jul 31 '22 at 22:40
  • Although your code does not give me the list of Genres, it does give me the list of keywords which is another data point I need. Thanks a lot! – Irish Mehta Aug 01 '22 at 17:02
  • Hmm, I don't visit urls really because as per SO standards we expect OP to provide the text based HTML, so contributors can test their code before publishing as answers. Updated the answer. – undetected Selenium Aug 01 '22 at 19:48
  • Will keep that in mind before asking another question on SO! – Irish Mehta Aug 02 '22 at 20:03