10

Is it possible to gracefully kill a joblib process (threading backend), and still return the so far computed results ?

parallel = Parallel(n_jobs=4, backend="threading")
result = parallel(delayed(dummy_f)(x) for x in range(100))

For the moment I came up with two solutions

  • parallel._aborted = True which waits for the started jobs to finish (in my case it can be very long)
  • parallel._terminate_backend() which hangs if jobs are still in the pipe (parallel._jobs not empty)

Is there a way to workaround the lib to do this ?

sknat
  • 468
  • 3
  • 14

1 Answers1

0

As far as I know, Joblib does not provide methods to kill spawned threads. As each child thread runs in its own context, it's actually difficult to perform graceful killing or termination. That being said, there is a workaround that could be adopted.

Mimic .join() (of threading) functionality (kind of):

  1. Create a shared memory shared_dict with keys corresponding each thread id, values if contain either thread output or Exception e.g.:

    shared_dict = {i: None for i in range(num_workers)}

  2. Whenever an error is raised in any thread, catch the exception through the handler and instead of raising it immediately, store it in the shared memory flag

  3. Create an exception handler which waits for all(shared_dict.values())

  4. After all values are filled with either result or error, exit the program by raising the error or logging or whatever.

  • Hi, thanks for your reply ! Right, afaik joblib already behaves somewhat like this, storing results in a shared map, and returning on completion. The issue I had was when wanting to stop the spawned threads before they reach completion. The api provided by joblib doesn't (didn't?) allow this without hacking private functions. I ended up hacking through it to make it work, but I still need to spend some time documenting it. – sknat Aug 09 '21 at 08:17
  • Hi, you are right, it is supposed to work that way, but in a UWSGI app deployment environment, the Joblib can mess things up. I had this issue because of multiple error handlers stacked up on each other. In my case, when any worker raises error, it used raise endpoint error but since the error was raised before Joblib can wait for others to complete, it just pauses the remaining workers and resumes them when the next request comes, which throws weird runtime errors because of clean ups done in previous request. – Rishav Dutta Aug 10 '21 at 09:07