I have a use case for which I'm unable to develop a logic. Floating it here for recommendations from experts.
Quick context:
I have a list of 2,500 URLs. I am able to scrape them sequentially using Python and Selenium.
Run time for 1,000 URLs is approximately 1.5 hours
What I am trying to achieve:
I am trying to optimize the run time through parallel execution. I had reviewed various posts on stack overflow. Somehow I am unable to find the missing pieces of the puzzle.
Details
I need to reuse the drivers, instead of closing and reopening them for every URL. I came across a post Python selenium multiprocessing that leverages threading.local(). Somehow the number of drivers that are opened exceed the number of threads specified if I rerun the same code
Please note that the website requires the user to login using user name and password. My objective is to launch the drivers (say 5 drivers) the first time and login. I would like to continue reusing the same drivers for all future URLs without having to close the drivers and logging in again
Also, I am new to Selenium web scraping. Just getting familiar with the basics. Multi-threading is uncharted territory. I would really appreciate your help here
Sharing my code snippet below:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import pandas as pd
from multiprocessing.dummy import Pool as ThreadPool
threadLocal = threading.local()
# Function to open web driver
def get_driver():
options = Options()
driver = webdriver.Chrome(<Location to chrome driver>, options = options)
return driver
# Function to login to website & scrape from website
def parse_url(url):
driver = get_driver()
login_url = "https://..."
driver.get(login_url)
# Enter user ID
# Enter password
# Click on Login button
# Open web page of interest & scrape
driver.get(url)
htmltext = driver.page_source
htmltext1 = htmltext[0:100]
return [url, htmltext1]
# Function for multi-threading
def main():
urls = ["url1",
"url2",
"url3",
"url4"]
pool = ThreadPool(2)
records = pool.map(parse_url, urls)
pool.close()
pool.join()
return records
if __name__ =="__main__":
result = pd.DataFrame(columns = ["url", "html_text"], data = main())
How can I modify the above code such that:
- I end up reusing my drivers
- Login to the website only once & scrape multiple URLs in parallel