-1

I've written a script using selenium implementing multiprocessing within it taking the idea of this answer. The script works just fine and I see all the results in the console. However, when the execution is done, I can't see any such signs at the bottom of the IDE which indicates that the process is accomplished.

The following images have been taken from python's default IDE and sublime text.

enter image description here

import threading
import concurrent.futures
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

threadLocal = threading.local()

def create_browser():
    driver = getattr(threadLocal, 'driver', None)
    if driver is None:
        options = webdriver.ChromeOptions()
        options.add_argument("--headless")
        driver = webdriver.Chrome(options=options)   
        setattr(threadLocal, 'driver', driver)
    return driver

def get_links(link):
    driver = create_browser()
    driver.get(link)
    for elem in WebDriverWait(driver,10).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR,".summary .question-hyperlink"))):
        yield elem.get_attribute("href")

def get_title(url):
    driver = create_browser()
    driver.get(url)
    title = WebDriverWait(driver,10).until(EC.presence_of_element_located((By.CSS_SELECTOR,"h1[itemprop='name'] > a.question-hyperlink"))).text
    return title

if __name__ == '__main__':
    base = "https://stackoverflow.com{}"
    URL = "https://stackoverflow.com/questions/tagged/web-scraping?tab=newest&page=1&pagesize=50"
    with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
        future_to_url = {executor.submit(get_title, link): link for link in get_links(URL)}
        for item in concurrent.futures.as_completed(future_to_url):
            print(item.result())

How can I terminate the process when the execution is accomplished?

MITHU
  • 113
  • 3
  • 12
  • 41
  • Could you please clarify what exactly is your issue? The script runs just fine and exits with `code 0` - see [this](https://imgur.com/ri3SpE0) – baduker Oct 15 '20 at 17:53
  • In my case when the execution is done I can't see the similar line that you are having, as in `process finished with exit code 0` and for this reason it appears that the script is still running even when it is accomplished. I just wish to see that line to make sure I'm done with it. – MITHU Oct 15 '20 at 19:04
  • Maybe the issue is how you run it? I don't know what OS you're on, but running your script both thru PyCharm and bash gives me the same output. I'm on Linux and all looks fine. Sometimes, no (error) message is a good message. – baduker Oct 15 '20 at 19:26
  • I'm on Win 7, 32 bit. I used python's default IDE and sublime text for the test. – MITHU Oct 15 '20 at 19:42

2 Answers2

1

The only thing I see as a potential problem is that because you are trying to be efficient in creating a single selenium driver per thread, you have neglected to handle "quitting" all the drivers when all the submitted jobs have completed and those driver processes, especially when run in an IDE, may very well not terminate. I would make the following changes:

  1. Add class Driver that will crate the driver instance and store it on thread-local storage but also have a destructor that will quit the driver when the thread-local storage is deleted:
class Driver:
    def __init__(self):
        options = webdriver.ChromeOptions()
        options.add_argument("--headless")
        self.driver = webdriver.Chrome(options=options)

    def __del__(self):
        self.driver.quit() # clean up driver when we are cleaned up
        #print('The driver has been "quitted".')
  1. create_browser now becomes:
def create_browser():
    the_driver = getattr(threadLocal, 'the_driver', None)
    if the_driver is None:
        the_driver = Driver()
        setattr(threadLocal, 'the_driver', the_driver)
    return the_driver.driver
  1. Finally, after all the Future results have been obtained, add the following lines to delete the thread-local storage and force the Driver instances' destructors to be called (hopefully):
del threadLocal
import gc
gc.collect() # a little extra insurance

Update

I should add that I do not have any problems running to completion (I print 'Done' following the call to gc.colleect()). I do, however, on my Windows desktop see the following messages logged:

[1024/092605.493:INFO:CONSOLE(0)] "Error with Feature-Policy header: Unrecognized feature: 'speaker'.", source:  (0)
[1024/092605.562:INFO:CONSOLE(0)] "Error with Feature-Policy header: Unrecognized feature: 'speaker'.", source:  (0)
[1024/092605.579:INFO:CONSOLE(0)] "Error with Feature-Policy header: Unrecognized feature: 'speaker'.", source:  (0)
[1024/092605.592:INFO:CONSOLE(0)] "Error with Feature-Policy header: Unrecognized feature: 'speaker'.", source:  (0)
[1024/092605.634:INFO:CONSOLE(0)] "Error with Feature-Policy header: Unrecognized feature: 'speaker'.", source:  (0)
...
[1024/092617.865:INFO:CONSOLE(118)] "The deviceorientation events are blocked by feature policy. See https://github.com/WICG/feature-policy/blob/master/features.md#sensor-features", source: https://z.moatads.com/chaseusdcm562975626226/moatad.js (118)
[1024/092617.949:INFO:CONSOLE(0)] "Error with Feature-Policy header: Unrecognized feature: 'speaker'.", source:  (0)
[1024/092618.015:INFO:CONSOLE(0)] "Error with Feature-Policy header: Unrecognized feature: 'speaker'.", source:  (0)
[1024/092618.456:INFO:CONSOLE(0)] "Error with Feature-Policy header: Unrecognized feature: 'speaker'.", source:  (0)
[1024/092618.479:INFO:CONSOLE(0)] "Error with Feature-Policy header: Unrecognized feature: 'speaker'.", source:  (0)
[1024/092618.570:INFO:CONSOLE(0)] "Error with Feature-Policy header: Unrecognized feature: 'speaker'.", source:  (0)
[1024/092618.738:INFO:CONSOLE(0)] "Error with Feature-Policy header: Unrecognized feature: 'speaker'.", source:  (0)
[1024/092618.849:INFO:CONSOLE(0)] "Error with Feature-Policy header: Unrecognized feature: 'speaker'.", source:  (0)
[1024/092618.928:INFO:CONSOLE(0)] "Error with Feature-Policy header: Unrecognized feature: 'speaker'.", source:  (0)

And this is my output:

ImportXML XPath issue using Google Sheets on a web scraping query
Scrapy meta or cb_kwargs not passing properly between multiple methods
How to seperate a list into table formate using python
How can I extract a table from wikipedia using Beautiful soup
Load a series of payload requests and perform pagination for each one of them
Pandas read_html not reading text properly
Getting text nested text in non-static webpage with httr in R [closed]
Scraping data with duplicate column headers [closed]
I keep getting [ TypeError: 'function' object is not iterable ] every time I try to iterate over the result of my function which returns an iterable [closed]
selnium and beutifulsoup scrapper very inconsistent
Web scraping the required content from a url link in R
Web-scrapping pop-up info generated by hovering over canvas element (Python/Selenium)
Daily leaderboard or price tracking data
Scrape PDF embedded in .php page
Beautiful Soup returning only the last URL of a txt file
Having trouble in scraping table data using beautiful soup
Authentication - Security Window - Rvest R
How can I read an iframe content inside another iframe using Puppeteer?
Xamarin.Forms: is there a way to update the style of web page displayed in a WebView with scraping?
Counter not working in for(i=0; ++i) loop node.js
Python: selenium can't read an specific table
Scraped json data want to output CSV file
Unable to scrape “shopee.com.my” top selling products page
How to click a menu item from mobile based website in selenium Python?
Does selenium in standalone mode has limitation for maximum number of sessions can be present at a time?
Error while capturing full website screen shot
How to retrieve SharePoint webpage code(html) or Scrape a sharepoint webpage?
API web data capture
Selenium select disappearing webelement
Python SQlite Query to select recently added data in the table
Webscraping with varying page numbers
How to extract contents between div tags with rvest and then bind rows
Why is the previous request aborting if I send a new request to the flask server? [closed]
Does anyone know how to click() on an href within data-bind using selenium? [closed]
Web Scraping on login sites with Python
How do I render image, title and link to template from views using one 'for loop'
Regex on List Comprehension Not Producing List But List of Lists Instead [duplicate]
How to get all tr id by using python selenium?
Scrapy - TypeError: can only concatenate str (not “list”) to str
I need to save scraped urls to a csv file in URI format. file won't write to csv
Scrapy keeps giving me the errot AttributeError: 'str' object has no attribute 'text'
How to scrape the different content with the same html attributes and values?
I can not scrape Google news with Beautiful soup. I am getting the error:TypeError: 'NoneType' object is not callable [closed]
selenium while loop error on load more button
Python- Selenium/BeautifulSoup PDF & Table scraper
Crawling all page with scrapy and FormRequest
Web Scrape COVID19 Data from Download Button in R
How do 3rd party app stores know when a new app is added to Google Play?
Scraping hidden leaderboard data from site
Cannot access a table shown in a Tableau Public Dashboard

Update 2

If you are hanging on a result, you might consider using a timeout:

if __name__ == '__main__':
    base = "https://stackoverflow.com{}"
    URL = "https://stackoverflow.com/questions/tagged/web-scraping?tab=newest&page=1&pagesize=50"
    with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
        future_to_url = {executor.submit(get_title, link): link for link in get_links(URL)}
        for future in future_to_url:
            try:
                print(future.result(30))
            except concurrent.futures.TimeoutError as e:
                url = future_to_url[future]
                print('TimeoutError for URL', url)
    del threadLocal
    import gc
    gc.collect() # a little extra insurance
    print('done')

Note that I am not using as_completed anymore since I want to be able to specify a timeout value and not wait for a result indefinitely. Here I specified a value of 30 seconds, which should be way more than enough to allow the thread to initialize the driver and get the first result. If you are in fact hanging on a result, this should print a TimeoutError message and continue,

Booboo
  • 38,656
  • 3
  • 37
  • 60
  • Using this line `del threadLocal` outside `with` block within `main` function in my script seems to have fixed the issue. Thanks for the solution @Booboo. – MITHU Oct 25 '20 at 03:49
  • Did you also use the code that ensures that a call to `driver.quit()` is done, i.e. the `Driver` class I had proposed? – Booboo Oct 25 '20 at 11:04
0

It seems like the process terminated fine, but if you want to make sure the process has terminated just import sys and include sys.exit at the end.

  • Although I should have used your suggested line outside with block, I used outside for loop like ***[this](https://imgur.com/zVrUH2T)*** but the script doesn't seem to reach that line and as a result it still gets stuck when the execution is done. Thanks. – MITHU Oct 22 '20 at 18:13