How to scrape all 25 the comics from all the pages within the pagination using Selenium and Python

Question

I am scraping this website https://www.dccomics.com/comics

If you scroll all the way down you will find a browse comics section with a pagination

I would like to scrape all 25 comics from pages 1-5

This is the code i currently have

from selenium import webdriver
from bs4 import BeautifulSoup 
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
import time


class Scraper():
    comics_url = "https://www.dccomics.com/comics"
    driver = webdriver.Chrome("C:\\laragon\\www\\Proftaak\\chromedriver.exe")
    # driver = webdriver.Chrome("C:\\laragon\\www\\proftaak-2020\\Proftaak-scraper\\chromedriver.exe")
    driver.get(comics_url)
    driver.implicitly_wait(500)
    current_page = 2

    def GoToComic(self):
        for i in range(1,3):
            time.sleep(2)
            goToComic = self.driver.find_element_by_xpath(f'//*[@id="dcbrowseapp"]/div/div/div/div[3]/div[3]/div[2]/div[{i}]/a/img')
            self.driver.execute_script("arguments[0].click();", goToComic)
            self.ScrapeComic()
            self.driver.back()
            self.ClearFilter()
            if self.current_page != 6:
                if i == 25:
                 self.current_page +=1
                 self.ToNextPage()

    def ScrapeComic(self):
        self.driver.implicitly_wait(250)
        title = [my_elem.text for my_elem in WebDriverWait(self.driver, 5).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[contains(@class, 'page-title')]")))]
        price = [my_elem.text for my_elem in WebDriverWait(self.driver, 5).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[contains(@class, 'buy-container-price')]/span[contains(@class, 'price')]")))]
        available = [my_elem.text for my_elem in WebDriverWait(self.driver, 5).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[contains(@class, 'sale-status-container')]/span[contains(@class, 'sale-status')]")))]
        try:
            description =  [my_elem.text for my_elem in WebDriverWait(self.driver, 5).until(EC.visibility_of_all_elements_located((By.CLASS_NAME, "field-items")))]
        except:
            return

    def ToNextPage(self):
        if self.current_page != 6:
            nextPage = self.driver.find_element_by_xpath(f'//*[@id="dcbrowseapp"]/div/div/div/div[3]/div[3]/div[3]/div[1]/ul/li[{self.current_page}]/a')
            self.driver.execute_script("arguments[0].click();", nextPage)
            self.GoToComic()

    def AcceptCookies(self):
        self.driver.implicitly_wait(250)
        cookies = self.driver.find_element_by_xpath('/html/body/div[1]/div[2]/div[4]/div[2]/div/button')
        self.driver.execute_script("arguments[0].click();", cookies)
        self.driver.implicitly_wait(100)

    def ClearFilter(self):
        self.driver.implicitly_wait(500)
        clear_filter = self.driver.find_element_by_class_name('clear-all-action')
        self.driver.execute_script("arguments[0].click();", clear_filter)

    def QuitDriver(self):
        self.driver.quit()

scraper = Scraper()

scraper.AcceptCookies()
scraper.ClearFilter()
scraper.GoToComic()
scraper.QuitDriver()

Now it scrapes the first 25 comics of the first page fine, but the problem arises when I go to the second page, It scrapes the first comic of page 2 fine, but when I go back to the page from the comic the filter will be reset and it will start at page 1 again.

How could I make it that it either resumes from the correct page or that the filter will always be off before going back to the comics page? i tried using something like sessions / cookies but it seems the filter is not being saved in anyway possible

score 1 · Accepted Answer · answered Jun 15 '20 at 12:30

The browse comics section within the webpage https://www.dccomics.com/comics doesn't have 5 pages as pagination but only total 3 pages. To scrape the names from each comic using Selenium and python you have to induce WebDriverWait for visibility_of_all_elements_located() and you can use the following xpath based Locator Strategies:

Code Block:

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException, ElementClickInterceptedException

options = webdriver.ChromeOptions() 
options.add_argument("start-maximized")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
driver = webdriver.Chrome(options=options, executable_path=r'C:\WebDrivers\chromedriver.exe')
driver.get('https://www.dccomics.com/comics')
while True:
    try:
        time.sleep(5)
        print([my_elem.text for my_elem in WebDriverWait(driver, 5).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[contains(@class, 'browse-result')]/a//p[not(contains(@class, 'result-date'))]")))])
        WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//ul[@class='pagination']//li[@class='active']//following::li[1]/a"))).click()
        print("Navigating to the next page")
    except (TimeoutException, ElementClickInterceptedException):
        print("No more pages to browse")
        break
driver.quit()

Console Output:

['PRIMER', 'DOOMSDAY CLOCK PART 2', 'CATWOMAN #22', 'ACTION COMICS #1022', 'BATMAN/SUPERMAN #9', 'BATMAN: GOTHAM NIGHTS #7', 'BATMAN: THE ADVENTURES CONTINUE #5', 'BIRDS OF PREY #1', 'CATWOMAN 80TH ANNIVERSARY 100-PAGE SUPER SPECTACULAR #1', 'DC GOES TO WAR', "DCEASED: HOPE AT WORLD'S END #2", 'DETECTIVE COMICS #1022', 'FAR SECTOR #6', "HARLEY QUINN: MAKE 'EM LAUGH #1", 'HOUSE OF WHISPERS #21', 'JOHN CONSTANTINE: HELLBLAZER #6', 'JUSTICE LEAGUE DARK #22', 'MARTIAN MANHUNTER: IDENTITY', 'SCOOBY-DOO, WHERE ARE YOU? #104', 'SHAZAM! #12', 'TEEN TITANS GO! TO CAMP #15', 'THE JOKER: 80 YEARS OF THE CLOWN PRINCE OF CRIME THE DELUXE EDITION', 'THE LAST GOD: TALES FROM THE BOOK OF AGES #1', 'THE TERRIFICS VOL. 3: THE GOD GAME', 'WONDER WOMAN #756']
Navigating to the next page
['YOUNG JUSTICE VOL. 2: LOST IN THE MULTIVERSE', 'AMETHYST #3', 'BATMAN #92', 'DC CLASSICS: THE BATMAN ADVENTURES #1', 'DC COMICS: THE ASTONISHING ART OF AMANDA CONNER', 'DIAL H FOR HERO VOL. 2: NEW HEROES OF METROPOLIS', 'HARLEY QUINN #73', "HARLEY QUINN: MAKE 'EM LAUGH #2", 'JUSTICE LEAGUE #46', 'JUSTICE LEAGUE ODYSSEY #21', 'LEGION OF SUPER-HEROES #6', 'LOIS LANE #11', 'NIGHTWING #71', 'TEEN TITANS GO! TO CAMP #16', "THE BATMAN'S GRAVE #7", 'THE FLASH #755', 'THE FLASH VOL. 12: DEATH AND THE SPEED FORCE', 'THE JOKER 80TH ANNIVERSARY 100-PAGE SUPER SPECTACULAR #1', 'YEAR OF THE VILLAIN: HELL ARISEN', 'YOUNG JUSTICE #15', 'SUPERMAN #22', 'BATMAN SECRET FILES #3', 'WONDER WOMAN: TEMPEST TOSSED', 'HAWKMAN #24', 'JOKER: THE DELUXE EDITION']
Navigating to the next page
['METAL MEN #7', 'NIGHTWING ANNUAL #3', 'BATGIRL VOL. 7: ORACLE RISING', 'BATMAN & THE OUTSIDERS #13', 'BATMAN: GOTHAM NIGHTS #9', 'CATWOMAN VOL. 3: FRIEND OR FOE?', 'DAPHNE BYRNE #5', "DCEASED: HOPE AT WORLD'S END #3", 'STRANGE ADVENTURES #2', 'THE FLASH ANNUAL (REBIRTH) #3', 'THE GREEN LANTERN SEASON TWO #4', 'THE QUESTION: THE DEATHS OF VIC SAGE #3', 'WONDER WOMAN #757', 'WONDER WOMAN: AGENT OF PEACE #6', 'WONDER WOMAN: DEAD EARTH #3', 'DARK NIGHTS: DEATH METAL #1', 'YOU BROUGHT ME THE OCEAN']
No more pages to browse

Yes, this indeed seems to scrape the pagination + titles correctly, but would this still work when I click on a comic and scrape the price + status + description. go back to the web page and continue scraping each comic separately. until i scraped all 25 comics and proceed to the next page to repeat the process? — niels van hoof, Jun 15 '20 at 15:42
@nielsvanhoof That's achievable but that would need additional lines of code along with a more structured logic. — undetected Selenium, Jun 15 '20 at 21:54

score 0 · Answer 2 · answered Jun 15 '20 at 11:49

Browser back functionality takes you to previous visited URL. In the web-site you mention the single URL is used for all the pages (looks like they are loaded by JS to the same page so no new URL is required for new page of comics)

This is why when you get back from the first comic of the second page you just reload https://www.dccomics.com/comics where the first page is loaded as default.

I can also see that there is no dedicated control for getting back to the list from comic details.

Hence the only way is to store the number of current page somewhere in your code and then switch to that concrete number once you have got back from a comic detail page.

How to scrape all 25 the comics from all the pages within the pagination using Selenium and Python

2 Answers2