0

I'm running a webdriver with selenium to scrape items from a page after sending keys and clicking on a button. However, my data is fairly large ~ 28000 rows and the time it takes to complete a single page is ~0.8-1.3 seconds. Are there ways in improving the speed so it drops down to <0.5 second per page? I have thought of using multiprocessing however I'm unexperienced in that field.

Here's what I've managed to create which is most efficient so far but I'm pretty convinced it could go faster.

#Example data
df = pd.DataFrame(defaultdict(list,
            {'salary': [28452, 28452, 31000, 35000, 35000],
             'tuition': [27750, 27750, 27750, 27750, 27750],
             'country': ['England',
              'England',
              'England',
              'England',
              'England'],
             'category': ['anatomy',
              'physiology',
              'finance',
              'finance',
              'finance']}))

The imports:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait 
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
import pandas as pd
from collections import defaultdict
import time
import timeit

This data is used to loop the send_keys and clicks:

debt1 = []
salary1 = []
loan1 = []
plan21 = []
button1 = []


for i in range(0, len(df)):
    i
    debt1.append("//input[@id='debt']")
    salary1.append("//input[@id='salary']")
    loan1.append("//select[@id='loan-type']")
    plan21.append("//select[@id='loan-type']/option[2]")
    button1.append("//button[@class='btn btn-primary calculate-button']")

Finally, here's the selenium driver to scrape the information:

i = 0
driver = webdriver.Chrome()

while i < len(df):
    for salary,tuition,category,country,deb, sal, plan, lo, but in zip(df.salary,df.tuition,df.category, df.country,debt1,salary1, loan1, plan21, button1):
        start = timeit.default_timer()
        driver.get("https://www.student-loan-calculator.co.uk/")     
        driver.find_element(By.XPATH, sal
                                ).clear()
        driver.find_element(By.XPATH, sal
                                ).send_keys(salary)
        driver.find_element(By.XPATH, deb
                                ).clear()
        driver.find_element(By.XPATH, deb
                                ).send_keys(str(tuition))
        driver.find_element(By.XPATH, lo).click()
        driver.find_element(By.XPATH, plan).click()
        driver.find_element(By.XPATH, but).click()
        driver2 = driver.page_source
        
        tables_debt['table'].append(pd.read_html(driver2))
        tables_debt['category'].append(category)
        tables_debt['country'].append(country)
        i+= 1
        stop = timeit.default_timer()
        print(f"You're on this country: {country} and this row number {i}", 'And the total time is:', stop - start)


Here's the output speed that I get. It starts off slow at first but ranges from 0.8 - 1.3 afterwards:

You're on this country: England and this row number 1 And the total time is: 2.415818831999786
You're on this country: England and this row number 2 And the total time is: 1.8458935059970827
You're on this country: England and this row number 3 And the total time is: 1.2618036500025482
You're on this country: England and this row number 4 And the total time is: 0.8504524469972239
You're on this country: England and this row number 5 And the total time is: 0.8366585959993245

With the webpage provided here's what I have come up with:

from multiprocessing.pool import ThreadPool
from bs4 import BeautifulSoup
from selenium import webdriver
import threading
import gc

class Driver:
    def __init__(self):
        options = webdriver.ChromeOptions()
        options.add_argument("--headless")
        # suppress logging:
        options.add_experimental_option('excludeSwitches', ['enable-logging'])
        self.driver = webdriver.Chrome(options=options)
        print('The driver was just created.')

    def __del__(self):
        self.driver.quit() # clean up driver when we are cleaned up
        print('The driver has terminated.')


threadLocal = threading.local()

def create_driver():
    the_driver = getattr(threadLocal, 'the_driver', None)
    if the_driver is None:
        the_driver = Driver()
        setattr(threadLocal, 'the_driver', the_driver)
    return the_driver.driver


def get_title(url):
    driver = create_driver()
    i = 0
    #driver = webdriver.Chrome()
    tables_debt = defaultdict(list)
    while i < len(df):
        for salary,tuition,category,country,deb, sal, plan, lo, but in zip(df.salary,df.tuition,df.category, df.country,debt1,salary1, loan1, plan21, button1):
            start = timeit.default_timer()
            driver.get(url)     
            driver.find_element(By.XPATH, sal
                                    ).clear()
            driver.find_element(By.XPATH, sal
                                    ).send_keys(salary)
            driver.find_element(By.XPATH, deb
                                    ).clear()
            driver.find_element(By.XPATH, deb
                                    ).send_keys(str(tuition))
            driver.find_element(By.XPATH, lo).click()
            driver.find_element(By.XPATH, plan).click()
            driver.find_element(By.XPATH, but).click()
            driver2 = driver.page_source
            tables_debt['table'].append(pd.read_html(driver2))
            tables_debt['category'].append(category)
            tables_debt['country'].append(country)
            i+= 1
            stop = timeit.default_timer()
            print(f"You're on this country: {country} and this row number {i}", 'And the total time is:', stop - start)



# just 2 threads in our pool for demo purposes:
with ThreadPool(10) as pool:
    urls = [
        "https://www.student-loan-calculator.co.uk/"
    ]
    pool.map(get_title, urls)
    # must be done before terminate is explicitly or implicitly called on the pool:
    del threadLocal
    gc.collect()
# pool.terminate() is called at exit of with block

however the output shows a similarity to the original code:

The driver was just created.
You're on this country: England and this row number 1 And the total time is: 1.7789302960009081
You're on this country: England and this row number 2 And the total time is: 1.3455885630028206
You're on this country: England and this row number 3 And the total time is: 1.7194338539993623
You're on this country: England and this row number 4 And the total time is: 0.8721739090033225
You're on this country: England and this row number 5 And the total time is: 0.9608322030035197
me.limes
  • 441
  • 1
  • 13
  • Does this answer your question? [How To Run Selenium-scrapy in parallel](https://stackoverflow.com/questions/66056697/how-to-run-selenium-scrapy-in-parallel). You can ignore the "scrapy" part of the question. The point is that you want to use multithreading, not multiprocessing and this solution processes M pages (28000 in your case) with N drivers where N can be much less than M so as to not to create too many Selenium driver processes. – Booboo Jan 13 '22 at 15:09
  • @Booboo I'll update my post whilst implementing your answer from the link. Please let me know if I have to make any amendements to make it faster. It is slightly faster as I sometimes get scrapes at 0.69 seconds, but it also produces quite a few around 1.3-1.5 seconds. Perhaps there's an improvement that I have missed? I'm amateur to this stuff so I'd really appreciate your support! I'll be using the result as a template to learn from for future scrapes. – me.limes Jan 13 '22 at 15:20
  • Your list of `urls` only contains a single URL so you will only be invoking your "worker function", `get_title`, once and therefore there will be no concurrency. Also, in your for loop, variables debt1,salary1, loan1, plan21 and button1 are not defined so I cannot fix this for you. And as an aside, why do you assign `driver.page_source` to a variable named `driver2`, which really is not the best name for page source? – Booboo Jan 13 '22 at 15:54
  • @Booboo The variables are defined somewhere further up the page, I created a list of the xpaths along with a reproducible example of my dictionary. Everything should run as is as it's all reproducible. How might I resolve the concurrency issue? There's no real reason! It was a quick fix to a problem I had before, but I just never changed the name. It seems that my post looks at improving the speed for a single URL rather than multiple. – me.limes Jan 13 '22 at 16:01
  • 1
    [This](https://ideone.com/nBGJAy) is how I would do it. It is, of course, untested -- just use the code and a starting point and try to figure out what is going on. Spend at least two years (just kidding -- but a long time reading the documentation and studying the code) before you even *think about* asking me a follow-up question . Thanks. – Booboo Jan 13 '22 at 16:47
  • @Booboo You GENIUS! I now have a very long homework assignment to understand :) – me.limes Jan 13 '22 at 17:22
  • I can be all wrong, but at these prices who can complain? – Booboo Jan 13 '22 at 17:54

0 Answers0