1

I want to webscrape data from the imdb. In order to do it for multiple pages i have used click() method of the selenum package.

Here is my code:

from bs4 import BeautifulSoup
from selenium import webdriver
import pandas as pd

pages = [str(i) for i in range(10)]

#getting url for each page and year:
url = 'https://www.imdb.com/search/title?release_date=2018&sort=num_votes,desc&page=1'
driver = webdriver.Chrome(r"C:\Users\yefida\Desktop\Study_folder\Online_Courses\The Complete Python Course\Project 2 - Quotes Webscraping\chromedriver.exe")
driver.get(url)
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')

for page in pages:
    data = soup.find_all('div', class_ = 'lister-item mode-advanced')
    data_list = []
    for item in data:
        temp = {}
    #Name of movie
        temp['movie'] = item.h3.a.text
    #Year
        temp['year'] = item.find('span',{'class':'lister-item-year text-muted unbold'}).text.replace('(','').replace(')','').replace('I','').replace('–','')
    #Runtime in minutes
        temp['time'] = item.find('span',{'class':'runtime'}).text.replace(' min','')
    #Genre
        temp['genre'] = item.find('span',{'class':'genre'}).text.replace(' ','').replace('\n','')
    #Raiting of users
        temp['raiting'] = item.find('div',{'class':'inline-block ratings-imdb-rating'}).text.replace('\n','').replace(',','.')
    #Metascore
        try:
            temp['metascore'] = item.find('div',{'class':'inline-block ratings-metascore'}).text.replace('\n','').replace('Metascore','').replace(' ','')
        except:
            temp['metascore'] = None
        data_list.append(temp)

    #next page
    continue_link = driver.find_element_by_link_text('Next')
    continue_link.click()

At the end I am getting an error:

'Message: no such element: Unable to locate element: {"method":"link text","selector":"Next"}
  (Session info: chrome=70.0.3538.102)
'

Can you help me to cerrect it?

eyllanesc
  • 235,170
  • 19
  • 170
  • 241
DY92
  • 437
  • 5
  • 18

3 Answers3

1

That's because link text is actually "Next »", so try either

continue_link = driver.find_element_by_link_text('Next »')

or

continue_link = driver.find_element_by_partial_link_text('Next')
JaSON
  • 4,843
  • 2
  • 8
  • 15
1

You could also use a CSS selector target the class of the next button

driver.find_element_by_css_selector('.lister-page-next.next-page').click()

This class is consistent across pages. You could add a wait for element to be clickable:

WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, '.lister-page-next.next-page')))

My understanding is that CSS selector should be a fast matching method. Some benchmarks here.

QHarr
  • 83,427
  • 12
  • 54
  • 101
  • Thank you! It works, but not exactly what I want: It scrapes only the first page out of 10. – DY92 Nov 23 '18 at 20:51
  • The class remains the same across pages so this should work if used on each new page. You could add wait for element to become clickable. – QHarr Nov 23 '18 at 20:54
1

Complying the following logic you can update your soup element with the new page content. I used xpath '//a[contains(.,"Next")]' to click on the next page button. The script should keep clicking on the next page button until there is no more button to click and finally break out of it. Give it a go:

from selenium import webdriver
from bs4 import BeautifulSoup

url = 'https://www.imdb.com/search/title?release_date=2018&sort=num_votes,desc&page=1'

driver = webdriver.Chrome()
driver.get(url)
soup = BeautifulSoup(driver.page_source,"lxml")

while True:
    items = [itm.get_text(strip=True) for itm in soup.select('.lister-item-content a[href^="/title/"]')]
    print(items)

    try:
        driver.find_element_by_xpath('//a[contains(.,"Next")]').click()
        soup = BeautifulSoup(driver.page_source,"lxml")
    except Exception: break
SIM
  • 21,997
  • 5
  • 37
  • 109