See my comment about how you might modify your code to append to an EXCEL file or CSV file. The following code takes your current code (but would be equally useful for your code when modified to append as you go along) and modifies it to use a thread pool for retrieving URLs. It not only will increase the performance, but if a retrieval of a URL hangs, you still have N - 1 threads left for processing (assuming they don't start hanging, too).
The tricky bit, which also greatly increases performance, is to only start up a Selenium driver once per thread and not for each URL request. A reference to the driver for each thread in the pool is kept in thread local storage. The reference to thread local storage is ultimately removed (and for good measure, garbage collection is run) to force the destructor call on the Driver
instances that ensure a call to driver.quit()
is called for each dirver
instance.
Again, this code can be modified to append the results returned from the call to process_url
to append the results to an existing xlsx or cvs file rather than appending the results to records
, creating single large dataframe and cretaing the spreadsheet in one shot.
from concurrent.futures import ThreadPoolExecutor, TimeoutError
from selenium import webdriver
from bs4 import BeautifulSoup
import threading
import gc
threadLocal = threading.local()
class Driver:
def __init__(self):
# example of how you might be creating your driver:
options = webdriver.ChromeOptions()
options.add_argument("--headless")
self.driver = webdriver.Chrome(options=options)
def __del__(self):
self.driver.quit() # clean up driver when we are cleaned up
def create_driver():
the_driver = getattr(threadLocal, 'the_driver', None)
if the_driver is None:
the_driver = Driver()
setattr(threadLocal, 'the_driver', the_driver)
return the_driver.driver
def process_url(url):
driver = create_driver()
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'html.parser')
results = soup.find_all('div',{'id':'dp'})
records = []
for item in results:
record = extract_record(item)
if record:
records.append(record)
return records
N_THREADS = 20 # play around with this number
TIMEOUT = 10 # maximum time for process_url to complete
records = []
with ThreadPoolExecutor(N_THREADS) as executor:
futures = [executor.submit(process_url, url) for url in final]
for future in futures:
try:
new_records = future.result(timeout=TIMEOUT)
records.extend(new_records)
except TimeoutError:
print('TimeoutError: this thread may be permanently hung.')
# clean up drivers
threadLocal = None # run destructors
gc.collect()
df = pd.DataFrame(records)
df.columns = ['item_name', 'price_Final', 'Final_bullet','prod_desc','Image_link', 'Stock', 'Prod_spec ','Quantity_avail','Prod_spec','Brand_info','Add_info','Delivered_by','BTG']
writer = ExcelWriter('Desktop/Data Scrap.xlsx')
df.to_excel(writer,'Sheet1',index=False)
writer.save()
To Append to a CSV File As You Go Along
Change the loop that process the Future
instances to:
for i, future in enumerate(futures):
try:
new_records = future.result(timeout=TIMEOUT)
with open('Desktop/Data Scrap.csv', 'w' if i == 0 else 'a', newline='') as csvfile:
csvwriter = csv.writer(csvfile, delimiter=',')
if i == 0:
csvwriter.writerow(['item_name', 'price_Final', 'Final_bullet','prod_desc','Image_link', 'Stock', 'Prod_spec ','Quantity_avail','Prod_spec','Brand_info','Add_info','Delivered_by','BTG'])
for record in new_records:
csvwriter.writerow(record)
except TimeoutError:
print('TimeoutError: this thread may be permanently hung.')