6

I have a use case for which I'm unable to develop a logic. Floating it here for recommendations from experts.

Quick context:
I have a list of 2,500 URLs. I am able to scrape them sequentially using Python and Selenium.
Run time for 1,000 URLs is approximately 1.5 hours

What I am trying to achieve:
I am trying to optimize the run time through parallel execution. I had reviewed various posts on stack overflow. Somehow I am unable to find the missing pieces of the puzzle.

Details

  1. I need to reuse the drivers, instead of closing and reopening them for every URL. I came across a post Python selenium multiprocessing that leverages threading.local(). Somehow the number of drivers that are opened exceed the number of threads specified if I rerun the same code

  2. Please note that the website requires the user to login using user name and password. My objective is to launch the drivers (say 5 drivers) the first time and login. I would like to continue reusing the same drivers for all future URLs without having to close the drivers and logging in again

  3. Also, I am new to Selenium web scraping. Just getting familiar with the basics. Multi-threading is uncharted territory. I would really appreciate your help here

Sharing my code snippet below:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import pandas as pd
from multiprocessing.dummy import Pool as ThreadPool



threadLocal = threading.local()


# Function to open web driver
def get_driver():
    options = Options()
    driver = webdriver.Chrome(<Location to chrome driver>, options = options)    
    return driver


# Function to login to website & scrape from website
def parse_url(url):
    driver = get_driver()
    login_url = "https://..."
    driver.get(login_url)

    # Enter user ID
    # Enter password
    # Click on Login button

    # Open web page of interest & scrape
    driver.get(url)
    htmltext = driver.page_source
    htmltext1 = htmltext[0:100]
    return [url, htmltext1]
    

# Function for multi-threading
def main():
    urls = ["url1",
            "url2",
            "url3",
            "url4"]

    pool = ThreadPool(2)
    records = pool.map(parse_url, urls)
    pool.close()
    pool.join()
    
    return records


if __name__ =="__main__":
    result = pd.DataFrame(columns = ["url", "html_text"], data = main())

How can I modify the above code such that:

  1. I end up reusing my drivers
  2. Login to the website only once & scrape multiple URLs in parallel
GhulKing
  • 61
  • 1
  • 2
  • Hi mate - good question - I don't think reusing the open browsers is the best solution. I think that adds a level of complication when you'll have 5 open at once. As an alternative, how does your site authenticate? - if it's cookies, you can log in once, use get_cookies to store your session in a variable, *then* kick off the 5x parallel execution - every time you get a new browser, set the cookies from your store. Potentially that means no more logging in and navigating directly to your target url – RichEdwards Jul 17 '20 at 09:21
  • Did you find the solution? if so could you please share it? thank you! – Ajay Pyatha Sep 22 '21 at 03:00

2 Answers2

0

I believe that starting browsers in separate processes and communicate with him via queue is a good approach (and more scalable). Process can be easily killed and respawned if something went wrong. The pseudo-code might look like this:

#  worker.py 
def entrypoint(in_queue, out_queue):  # run in process
    crawler = Crawler()
    browser = Browser() # init, login and etc.
    while not stop:
        command = in_queue.get()
        result = crawler.handle(command, browser)
        out_queue.put(result)            

# main.py
import worker

in_queue, out_queue = create_queues()
create_process(worker.entrypoint, args=(in_queue, out_queue))
while not stop:
    in_queue.put(new_task)
    result = out_queue.get()
alex_noname
  • 26,459
  • 5
  • 69
  • 86
0

I know its too late to answer this question but , i am gonna drop the code snippet which does the job for someone who needs it.

drivers_dict={}
#We are trying to make driver instance for each of the thread, so that we can reuse it.    
def scraping_function(link):
        try:
            thread_name= threading.current_thread().name
            #sometime we are going to have different thread name in each iteration so a little regex might help
            thread_name = re.sub("ThreadPoolExecutor-(\d*)_(\d*)", r"ThreadPoolExecutor-0_\2", thread_name)
            print(f"re.sub -> {thread_name}")
            driver = drivers_dict[thread_name]
        except KeyError:
            drivers_dict[threading.current_thread().name] = webdriver.Chrome(PATH, options=chrome_options)
            driver = drivers_dict[threading.current_thread().name]
        driver.get(link)
Ajay Pyatha
  • 154
  • 1
  • 9