1

I have two seperate selenium codes that scrape a website and download a file. I am trying to merge them into one script and make them run simultaneously rather than sequentially. Can someone create a working code that merges the two so that they run in parallel?.

Here is the first code:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.action_chains import ActionChains
import os
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

options=Options()
options.add_argument("--headless")
options.add_argument("--window-size=1920,1080")

driver=webdriver.Chrome(options=options)

params={'behavior':'allow','downloadPath':os.getcwd()}
driver.execute_cdp_cmd('Page.setDownloadBehavior',params)

driver.get("https://www.ons.gov.uk/")
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.NAME, "q"))).send_keys("Education and childcare")
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.NAME, "q"))).send_keys("Education and childcare")
click_button=driver.find_element_by_xpath('//*[@id="nav-search-submit"]').click()
click_button=driver.find_element_by_xpath('//*[@id="results"]/div[1]/div[2]/div[1]/h3/a/span').click()
click_button=driver.find_element_by_xpath('//*[@id="main"]/div[2]/div[1]/section/div/div[1]/div/div[2]/h3/a/span').click()
click_button=driver.find_element_by_xpath('//*[@id="main"]/div[2]/div/div[1]/div[2]/p[2]/a').click()

and here is the second code:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.action_chains import ActionChains
import os
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

options=Options()
#options.add_argument("--headless")
#options.add_argument("--window-size=1920,1080")

driver=webdriver.Chrome(options=options)

params={'behavior':'allow','downloadPath':os.getcwd()}
driver.execute_cdp_cmd('Page.setDownloadBehavior',params)

driver.get("https://data.gov.uk/")
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "/html/body/div[3]/main/div[2]/form/div/div/input"))).send_keys("Forestry Statistics 2018: Recreation")
click_button=driver.find_element_by_xpath('/html/body/div[3]/main/div[2]/form/div/div/div/button').click()
click_button=driver.find_element_by_xpath('/html/body/div[3]/form/main/div/div[2]/div[2]/div[2]/h2/a').click()
click_button=driver.find_element_by_xpath('/html/body/div[3]/main/div/div/div/section/table/tbody/tr[2]/td[1]/a').click()
  • You need to look into multithreading. Set up the two pieces of logic as functions and use this approach! https://stackoverflow.com/a/7207336/8039598 – Riko Hamblin Apr 23 '22 at 02:13
  • Is it possible to run this selenium script within scrapy? I read that scrapy is really good at multiprocessing. –  Apr 23 '22 at 20:15
  • Take a look at pyppeteer library. It lets you control the browser asynchronously so you can do work on two tabs at the same time. – irdkwmnsb Apr 25 '22 at 13:12

1 Answers1

2

The simplest approach is to just create a multithreading pool of size 2 (you do not need a multiprocessing pool since each Chrome driver is already running in its own process):

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.action_chains import ActionChains
import os
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

from multiprocessing.pool import ThreadPool
from functools import partial

def getDriver():
    options = Options()
    options.add_argument("--headless")
    options.add_argument("--window-size=1920,1080")

    driver = webdriver.Chrome(options=options)
    return driver

def task1():
    driver = getDriver()
    try:
        params = {'behavior':'allow','downloadPath':os.getcwd()}
        driver.execute_cdp_cmd('Page.setDownloadBehavior',params)

        driver.get("https://www.ons.gov.uk/")
        WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.NAME, "q"))).send_keys("Education and childcare")
        WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.NAME, "q"))).send_keys("Education and childcare")
        click_button = driver.find_element_by_xpath('//*[@id="nav-search-submit"]').click()
        click_button = driver.find_element_by_xpath('//*[@id="results"]/div[1]/div[2]/div[1]/h3/a/span').click()
        click_button = driver.find_element_by_xpath('//*[@id="main"]/div[2]/div[1]/section/div/div[1]/div/div[2]/h3/a/span').click()
        click_button = driver.find_element_by_xpath('//*[@id="main"]/div[2]/div/div[1]/div[2]/p[2]/a').click()
    finally:
        driver.quit()

def task2():
    driver = getDriver()
    try:
        params={'behavior':'allow','downloadPath':os.getcwd()}
        driver.execute_cdp_cmd('Page.setDownloadBehavior',params)

        driver.get("https://data.gov.uk/")
        WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "/html/body/div[3]/main/div[2]/form/div/div/input"))).send_keys("Forestry Statistics 2018: Recreation")
        click_button = driver.find_element_by_xpath('/html/body/div[3]/main/div[2]/form/div/div/div/button').click()
        click_button = driver.find_element_by_xpath('/html/body/div[3]/form/main/div/div[2]/div[2]/div[2]/h2/a').click()
        click_button = driver.find_element_by_xpath('/html/body/div[3]/main/div/div/div/section/table/tbody/tr[2]/td[1]/a').click()
    finally:
        driver.quit()

def error_callback(task_name, e):
    print(f'{task_name} completed with exception {e}')

POOL_SIZE = 2 # We only need 2 for this case
pool = ThreadPool(POOL_SIZE)
pool.apply_async(task1, error_callback=partial(error_callback, 'task1'))
pool.apply_async(task2, error_callback=partial(error_callback, 'task2'))
# Wait for tasks to complete
pool.close()
pool.join()
Booboo
  • 38,656
  • 3
  • 37
  • 60
  • A beautiful code. The only thing extra I did was to import time and add a small time.sleep(5) command just before 'finally:' as it gave the files some time to complete the downloads before quitting. Thank you very much for the codes. I've never done multi threading before. Just out of curiosity. If I wanted to add more tasks lets say n number of tasks to this code. Can I just add it as a 'def', say def 'taskn():'?. Given of course that the tasks all work in the same way. –  Apr 25 '22 at 17:20
  • 1
    The short answer is, "Yes" -- just remember to increase the pool size to N if you have N tasks. I am on the way out for a while, but here are two different cases (1) Where all the tasks are the same except for the URL that is being "gotten", in which case you would want to use the `multiprocessing.pool.ThreadPool.map` method with a single worker function that receives the URL as an argument and (2) Where you have N tasks where N is very large but you want to limit the pool size to M < N so as to not have too many driver processes running at the same time *but* you want to reuse the M drivers. – Booboo Apr 25 '22 at 18:28
  • 1
    See [this question and answer](https://stackoverflow.com/questions/71500717/python-multiprocessing-gets-stuck-with-selenium) for an example of both (1) and (2). – Booboo Apr 25 '22 at 18:29
  • Thanks for the information.. This will prove to be really useful. I'm new to multithreading and python in general so some things may go over my head for now although I'll be playing around with it a lot over the next few weeks and reading up on it too, so hopefully it would all become clearer with time. –  Apr 25 '22 at 22:27