Webscraping with BeautifulSoup multiple pages using click() method

Question

I want to webscrape data from the imdb. In order to do it for multiple pages i have used click() method of the selenum package.

Here is my code:

from bs4 import BeautifulSoup
from selenium import webdriver
import pandas as pd

pages = [str(i) for i in range(10)]

#getting url for each page and year:
url = 'https://www.imdb.com/search/title?release_date=2018&sort=num_votes,desc&page=1'
driver = webdriver.Chrome(r"C:\Users\yefida\Desktop\Study_folder\Online_Courses\The Complete Python Course\Project 2 - Quotes Webscraping\chromedriver.exe")
driver.get(url)
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')

for page in pages:
    data = soup.find_all('div', class_ = 'lister-item mode-advanced')
    data_list = []
    for item in data:
        temp = {}
    #Name of movie
        temp['movie'] = item.h3.a.text
    #Year
        temp['year'] = item.find('span',{'class':'lister-item-year text-muted unbold'}).text.replace('(','').replace(')','').replace('I','').replace('–','')
    #Runtime in minutes
        temp['time'] = item.find('span',{'class':'runtime'}).text.replace(' min','')
    #Genre
        temp['genre'] = item.find('span',{'class':'genre'}).text.replace(' ','').replace('\n','')
    #Raiting of users
        temp['raiting'] = item.find('div',{'class':'inline-block ratings-imdb-rating'}).text.replace('\n','').replace(',','.')
    #Metascore
        try:
            temp['metascore'] = item.find('div',{'class':'inline-block ratings-metascore'}).text.replace('\n','').replace('Metascore','').replace(' ','')
        except:
            temp['metascore'] = None
        data_list.append(temp)

    #next page
    continue_link = driver.find_element_by_link_text('Next')
    continue_link.click()

At the end I am getting an error:

'Message: no such element: Unable to locate element: {"method":"link text","selector":"Next"}
  (Session info: chrome=70.0.3538.102)
'

Can you help me to cerrect it?

score 1 · Answer 1 · answered Nov 23 '18 at 20:31

1

That's because link text is actually "Next »", so try either

continue_link = driver.find_element_by_link_text('Next »')

or

continue_link = driver.find_element_by_partial_link_text('Next')

answered Nov 23 '18 at 20:31

JaSON

4,843
2
8
15

@DY92 , did you get an exception? – JaSON Nov 23 '18 at 20:53
No, just data from the first page – DY92 Nov 23 '18 at 21:00
@DY92 , that's because you're using BeautifulSoup which parses just the same source page on each iteration. You don't need BeautifulSoup - try common Selenium methods and properties – JaSON Nov 23 '18 at 21:05

QHarr · Answer 2 · 2018-11-23T20:56:05.190

1

You could also use a CSS selector target the class of the next button

driver.find_element_by_css_selector('.lister-page-next.next-page').click()

This class is consistent across pages. You could add a wait for element to be clickable:

WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, '.lister-page-next.next-page')))

My understanding is that CSS selector should be a fast matching method. Some benchmarks here.

edited Nov 23 '18 at 20:56

answered Nov 23 '18 at 20:46

QHarr

83,427
12
54
101

Thank you! It works, but not exactly what I want: It scrapes only the first page out of 10. – DY92 Nov 23 '18 at 20:51
The class remains the same across pages so this should work if used on each new page. You could add wait for element to become clickable. – QHarr Nov 23 '18 at 20:54

score 1 · Accepted Answer · answered Nov 23 '18 at 21:28

Complying the following logic you can update your soup element with the new page content. I used xpath '//a[contains(.,"Next")]' to click on the next page button. The script should keep clicking on the next page button until there is no more button to click and finally break out of it. Give it a go:

from selenium import webdriver
from bs4 import BeautifulSoup

url = 'https://www.imdb.com/search/title?release_date=2018&sort=num_votes,desc&page=1'

driver = webdriver.Chrome()
driver.get(url)
soup = BeautifulSoup(driver.page_source,"lxml")

while True:
    items = [itm.get_text(strip=True) for itm in soup.select('.lister-item-content a[href^="/title/"]')]
    print(items)

    try:
        driver.find_element_by_xpath('//a[contains(.,"Next")]').click()
        soup = BeautifulSoup(driver.page_source,"lxml")
    except Exception: break

If you don't have `lxml` installed in your machine, try using `html.parser` instead within `BeautifulSoup()`. — SIM, Nov 23 '18 at 21:31

Webscraping with BeautifulSoup multiple pages using click() method

3 Answers3