0

I have several heavy python functions performing various application critical tasks and management stuff running on my cluster.

Here's the thing: Noticing issues with Python threads getting stuck unpredictably.

I even have a service running 10+ threads, where a specific thread gets stuck while others are still actively running their jobs.

Most of these threads contain while True functions.

What is a good way to write reliable threads in Python, with a mechanism to self-recover if stuck

Meet Shah
  • 99
  • 8
  • 1
    The good way is to start learning about **Dead locks** in multithreading. Threads getting stuck is a symptom of bad code making the threads waiting for each other's resources. – Timmy Chan Jul 18 '23 at 23:33
  • 2
    Also without a [mcve] (or any code at all) this is off-topic since there is nothing to fix. – Mark Tolonen Jul 18 '23 at 23:38
  • [This post](https://stackoverflow.com/questions/34512/what-is-a-deadlock) explains deadlocks quite well. – AbeMonk Jul 18 '23 at 23:39
  • Very hard to reproduce, happens randomly and very infrequently... – Meet Shah Jul 19 '23 at 00:04
  • is it worthwhile to add a condition to terminate the while-loop after x iterations? (for example to end with a signal or flag or email, etc) – Adrian Ang Jul 19 '23 at 02:25

1 Answers1

1

The good way is to start learning about Deadlocks in multithreading. Threads getting stuck is a symptom of bad code causing the threads to wait for each other's resources.

Otherwise, make a "retry x times" and abort. And now you will need to make sure you don't lose your data by aborting. You should perform transactional operations and save the aborted jobs for later. This is very important for calls to databases and while dealing with inter-process messages. A common way to deal with this is having a (external) message queue, moving the messages of failed operations to dead letter queues (DLQ).

Timmy Chan
  • 933
  • 7
  • 15
  • Do you have any recommendations for message queue libraries? – Meet Shah Jul 19 '23 at 00:31
  • I would recommend ActiveMQ. It is standalone and you will probably want to deploy it to kubernetes and connect to each of the python threads to it and make them communicate using it. – Timmy Chan Jul 19 '23 at 00:55
  • Need something more simpler, seems ActiveMQ will add overhead – Meet Shah Jul 19 '23 at 01:38
  • The simplest (and cheapest) way is to solve the deadlocks as mentioned earlier. You could post some code for us to find out why it is getting deadlocks. But that will change your original question, which is "asking for ways to write reliable threads". – Timmy Chan Jul 19 '23 at 02:16