0

I am trying to scrape images from IMDB, but I am unable to get their URLs. The IMDB has load late in their image URLs and I do not know how to proceed further with this. So can you please help me?

from bs4 import BeautifulSoup
import requests
import urllib
from selenium import webdriver
import time
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait



    mimg = []

    imdb_link = "https://www.imdb.com/search/title/?groups=top_250&sort=user_rating,desc&count=250"
    opts = Options()
    opts.add_argument("--headless")
    opts.binary_location = 'C:\Program Files\Google\Chrome\Application\chrome.exe'
    chrome_driver = 'C:\Project\chromedriver.exe'
    driver = webdriver.Chrome(options=opts, executable_path=chrome_driver)
    element = WebDriverWait(driver, 3)
    driver.get(imdb_link)
    time.sleep(2)

    rmsoup = driver.page_source
    time.sleep(2)
    time.sleep(2)
    time.sleep(2)
    relsoup = BeautifulSoup(rmsoup, features='lxml')
    driver.close()

    for img in relsoup.findAll('img'):
        mimg.append(img.get('src'))
    print(mimg)
  • 1
    Refer this https://stackoverflow.com/questions/59130200/selenium-wait-until-element-is-present-visible-and-interactable – Keval Mar 24 '21 at 09:19

1 Answers1

1
import time
import selenium
from selenium import webdriver
from selenium.webdriver import Chrome
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.keys import Keys # Additional
from bs4 import BeautifulSoup



target_url = "https://www.imdb.com/search/title/?groups=top_250&sort=user_rating,desc&count=250"

c_options = Options()
c_options.add_argument("--start-maximized")
browser = Chrome(executable_path='chromedriver.exe', options=c_options)
browser.get(target_url)
# Scroll down the pages
# This is very bad, crude method, but for now - I didnt wanted to go thr' JavaScript
i = 0
while (i < 101):
    body = browser.find_element_by_css_selector('body')
    body.send_keys(Keys.PAGE_DOWN)
    time.sleep(2)
    i += 1
# Soup Logic    
img_links = []
html = browser.page_source
soup = BeautifulSoup(html, 'html.parser')
advanced_div = soup.findAll('div', attrs={'class': 'lister-item mode-advanced'})
for div in advanced_div:
    img = div.find('img')
    link = img['src']
    img_links.append(link)
print(img_links)

This is working for me.. Does this help you?

nkpydev
  • 11
  • 4
  • It is only giving the first 2 image URLs in jpg which is opening but others are in png which are not opening due to load-late. – Prabhat Rai Mar 24 '21 at 11:10
  • I got the point now, "Load-Late" part.. so basically the page needs to be scrolled down, for each image to be loaded properly, in ".jpg" format. I am changing the above answer code for that, in which I am doing the SCROLL part.. but very crude method of sending PAGE_DOWN Keys. Ideally, JavaScript should do this more elegantly. – nkpydev Mar 24 '21 at 11:44
  • @nkpydev you should include those details in your answer as well! Adding explanations to your code will greatly improve the effectiveness of your answer. – cjnash Mar 24 '21 at 14:09