The only thing I see as a potential problem is that because you are trying to be efficient in creating a single selenium driver per thread, you have neglected to handle "quitting" all the drivers when all the submitted jobs have completed and those driver processes, especially when run in an IDE, may very well not terminate. I would make the following changes:
- Add class
Driver
that will crate the driver instance and store it on thread-local storage but also have a destructor that will quit
the driver when the thread-local storage is deleted:
class Driver:
def __init__(self):
options = webdriver.ChromeOptions()
options.add_argument("--headless")
self.driver = webdriver.Chrome(options=options)
def __del__(self):
self.driver.quit() # clean up driver when we are cleaned up
#print('The driver has been "quitted".')
create_browser
now becomes:
def create_browser():
the_driver = getattr(threadLocal, 'the_driver', None)
if the_driver is None:
the_driver = Driver()
setattr(threadLocal, 'the_driver', the_driver)
return the_driver.driver
- Finally, after all the
Future
results have been obtained, add the following lines to delete the thread-local storage and force the Driver
instances' destructors to be called (hopefully):
del threadLocal
import gc
gc.collect() # a little extra insurance
Update
I should add that I do not have any problems running to completion (I print 'Done' following the call to gc.colleect()
). I do, however, on my Windows desktop see the following messages logged:
[1024/092605.493:INFO:CONSOLE(0)] "Error with Feature-Policy header: Unrecognized feature: 'speaker'.", source: (0)
[1024/092605.562:INFO:CONSOLE(0)] "Error with Feature-Policy header: Unrecognized feature: 'speaker'.", source: (0)
[1024/092605.579:INFO:CONSOLE(0)] "Error with Feature-Policy header: Unrecognized feature: 'speaker'.", source: (0)
[1024/092605.592:INFO:CONSOLE(0)] "Error with Feature-Policy header: Unrecognized feature: 'speaker'.", source: (0)
[1024/092605.634:INFO:CONSOLE(0)] "Error with Feature-Policy header: Unrecognized feature: 'speaker'.", source: (0)
...
[1024/092617.865:INFO:CONSOLE(118)] "The deviceorientation events are blocked by feature policy. See https://github.com/WICG/feature-policy/blob/master/features.md#sensor-features", source: https://z.moatads.com/chaseusdcm562975626226/moatad.js (118)
[1024/092617.949:INFO:CONSOLE(0)] "Error with Feature-Policy header: Unrecognized feature: 'speaker'.", source: (0)
[1024/092618.015:INFO:CONSOLE(0)] "Error with Feature-Policy header: Unrecognized feature: 'speaker'.", source: (0)
[1024/092618.456:INFO:CONSOLE(0)] "Error with Feature-Policy header: Unrecognized feature: 'speaker'.", source: (0)
[1024/092618.479:INFO:CONSOLE(0)] "Error with Feature-Policy header: Unrecognized feature: 'speaker'.", source: (0)
[1024/092618.570:INFO:CONSOLE(0)] "Error with Feature-Policy header: Unrecognized feature: 'speaker'.", source: (0)
[1024/092618.738:INFO:CONSOLE(0)] "Error with Feature-Policy header: Unrecognized feature: 'speaker'.", source: (0)
[1024/092618.849:INFO:CONSOLE(0)] "Error with Feature-Policy header: Unrecognized feature: 'speaker'.", source: (0)
[1024/092618.928:INFO:CONSOLE(0)] "Error with Feature-Policy header: Unrecognized feature: 'speaker'.", source: (0)
And this is my output:
ImportXML XPath issue using Google Sheets on a web scraping query
Scrapy meta or cb_kwargs not passing properly between multiple methods
How to seperate a list into table formate using python
How can I extract a table from wikipedia using Beautiful soup
Load a series of payload requests and perform pagination for each one of them
Pandas read_html not reading text properly
Getting text nested text in non-static webpage with httr in R [closed]
Scraping data with duplicate column headers [closed]
I keep getting [ TypeError: 'function' object is not iterable ] every time I try to iterate over the result of my function which returns an iterable [closed]
selnium and beutifulsoup scrapper very inconsistent
Web scraping the required content from a url link in R
Web-scrapping pop-up info generated by hovering over canvas element (Python/Selenium)
Daily leaderboard or price tracking data
Scrape PDF embedded in .php page
Beautiful Soup returning only the last URL of a txt file
Having trouble in scraping table data using beautiful soup
Authentication - Security Window - Rvest R
How can I read an iframe content inside another iframe using Puppeteer?
Xamarin.Forms: is there a way to update the style of web page displayed in a WebView with scraping?
Counter not working in for(i=0; ++i) loop node.js
Python: selenium can't read an specific table
Scraped json data want to output CSV file
Unable to scrape “shopee.com.my” top selling products page
How to click a menu item from mobile based website in selenium Python?
Does selenium in standalone mode has limitation for maximum number of sessions can be present at a time?
Error while capturing full website screen shot
How to retrieve SharePoint webpage code(html) or Scrape a sharepoint webpage?
API web data capture
Selenium select disappearing webelement
Python SQlite Query to select recently added data in the table
Webscraping with varying page numbers
How to extract contents between div tags with rvest and then bind rows
Why is the previous request aborting if I send a new request to the flask server? [closed]
Does anyone know how to click() on an href within data-bind using selenium? [closed]
Web Scraping on login sites with Python
How do I render image, title and link to template from views using one 'for loop'
Regex on List Comprehension Not Producing List But List of Lists Instead [duplicate]
How to get all tr id by using python selenium?
Scrapy - TypeError: can only concatenate str (not “list”) to str
I need to save scraped urls to a csv file in URI format. file won't write to csv
Scrapy keeps giving me the errot AttributeError: 'str' object has no attribute 'text'
How to scrape the different content with the same html attributes and values?
I can not scrape Google news with Beautiful soup. I am getting the error:TypeError: 'NoneType' object is not callable [closed]
selenium while loop error on load more button
Python- Selenium/BeautifulSoup PDF & Table scraper
Crawling all page with scrapy and FormRequest
Web Scrape COVID19 Data from Download Button in R
How do 3rd party app stores know when a new app is added to Google Play?
Scraping hidden leaderboard data from site
Cannot access a table shown in a Tableau Public Dashboard
Update 2
If you are hanging on a result, you might consider using a timeout:
if __name__ == '__main__':
base = "https://stackoverflow.com{}"
URL = "https://stackoverflow.com/questions/tagged/web-scraping?tab=newest&page=1&pagesize=50"
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
future_to_url = {executor.submit(get_title, link): link for link in get_links(URL)}
for future in future_to_url:
try:
print(future.result(30))
except concurrent.futures.TimeoutError as e:
url = future_to_url[future]
print('TimeoutError for URL', url)
del threadLocal
import gc
gc.collect() # a little extra insurance
print('done')
Note that I am not using as_completed
anymore since I want to be able to specify a timeout value and not wait for a result indefinitely. Here I specified a value of 30 seconds, which should be way more than enough to allow the thread to initialize the driver and get the first result. If you are in fact hanging on a result, this should print a TimeoutError
message and continue,