5

I am using ThreadPoolExecutor from python's concurrent.futures to parallelize scraping and writing results to a database. When doing so, I realized that I do not get any information if one of the threads fails. How can I properly be aware of which threads fail and why (so with the 'normal' traceback)? Below is a minimal working example.

import logging
logging.basicConfig(format='%(asctime)s  %(message)s', 
    datefmt='%y-%m-%d %H:%M:%S', level=logging.INFO)
from concurrent.futures import ThreadPoolExecutor

def worker_bee(seed):
    # sido is not defined intentionally to break the code
    result = seed + sido
    return result

# uncomment next line, and you will get the usual traceback
# worker_bee(1)

# ThreadPoolExecutor will not provide any traceback
logging.info('submitting all jobs to the queue')
with ThreadPoolExecutor(max_workers=4) as executor:
    for seed in range(0,10):
        executor.submit(worker_bee, seed)
    logging.info(f'submitted, waiting for threads to finish')

If I import logging inside worker_bee() and direct the messages to the root logger, I can see those in the final log. But I will only be able to see the log messages that I define, not the traceback of where the code actually fails.

martineau
  • 119,623
  • 25
  • 170
  • 301
gochristoph
  • 63
  • 1
  • 5

1 Answers1

3

You can get a "normal traceback" by retrieving the result from executor.submit(). This will allow some time to pass and the thread(s) to start executing (and possibly fail).

Here's what I mean:

from concurrent.futures import ThreadPoolExecutor
import logging

logging.basicConfig(format='%(asctime)s  %(message)s',
                    datefmt='%y-%m-%d %H:%M:%S', level=logging.INFO)

def worker_bee(seed):
    # sido is not defined intentionally to break the code
    result = seed + sido
    return result

logging.info('submitting all jobs to the queue')
with ThreadPoolExecutor(max_workers=4) as executor:
    results = []
    for seed in range(10):
        result = executor.submit(worker_bee, seed)
        results.append(result)
    logging.info(f'submitted, waiting for threads to finish')

for result in results:
    print(result.result())

Output:

20-03-21 16:21:24  submitting all jobs to the queue
20-03-21 16:21:24  submitted, waiting for threads to finish
Traceback (most recent call last):
  File "logging-threads-when-using-threadpoolexecutor.py", line 24, in <module>
    print(result.result())
  File "C:\Python3\lib\concurrent\futures\_base.py", line 432, in result
    return self.__get_result()
  File "C:\Python3\lib\concurrent\futures\_base.py", line 388, in __get_result
    raise self._exception
  File "C:\Python3\lib\concurrent\futures\thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
  File "logging-threads-when-using-threadpoolexecutor.py", line 12, in worker_bee
    result = seed + sido
NameError: name 'sido' is not defined

martineau
  • 119,623
  • 25
  • 170
  • 301
  • I think ```print(result.result())``` needs to be indented under ``with`` context as it's only defined there – kareem_emad Mar 21 '20 at 23:10
  • 1
    @Kareem: No that's not necessary because a `with` statement does not introduce a new scope. But your comment did make me notice something else.... – martineau Mar 21 '20 at 23:12
  • @martineau: Thanks, that's at least something for the development stage! Since the python docs mentions that [Future instances] (https://docs.python.org/3/library/concurrent.futures.html#concurrent.futures.Future) 'should not be created directly except for testing', I am a bit shy to use that in a productive environment. I already saw some behavior that is undesirable in a productive environment, meaning exceptions not showing up in the standard logger, but also in the command line. Another issue I could imagine coming into place could be memory when submitting millions of threads. – gochristoph Mar 22 '20 at 00:32
  • I think you're confused. `executor.submit()` returns a "results" object, not a future instance, and it's a common idiom to save the that object and later call its `result()` method to actually get the value the thread returned. Also note that if you want your threads to call the logger you will need to create a `Lock` for it to prevent multiple threads from using it at the same time. – martineau Mar 22 '20 at 00:50
  • @martineau: Thanks for clarifying. Reading the docs a second time, nothing speaks against saving the future objects to a list when created via `executor.submit()` (it returns a future object per definition). One remark: The print statement is obsolete, the exception will already be raised by calling the `result()` method. When specifying `filename='mylog.log'` inside `logging.basicConfig( )`, the exception will still only show up in the command line, not in mylog.log. Do you have an idea on how to channel that into the log file or what's the mechanism behind? – gochristoph Mar 22 '20 at 17:21
  • The `print()` call was just an excuse to call the `result()` method, not really important or meaningful otherwise. Sorry, I don't know of a way to automatically have all exceptions sent to the log file — but I'm no logging expert. – martineau Mar 22 '20 at 18:32
  • See answer to question [How do I log a Python error with debug information?](https://stackoverflow.com/questions/5191830/how-do-i-log-a-python-error-with-debug-information) — which shows the only way I know to log exceptions. It requires adding exception handlers and calling `logging.exception()` so it's hardly "automatic". – martineau Mar 22 '20 at 18:51
  • This [answer](https://stackoverflow.com/a/6234491/355230) to a related question looks like a way to automatically log unhandled exceptions. – martineau Mar 22 '20 at 19:03