0

I'm having a python 3.8+ program using Django and Postgresql which requires multiple threads or processes. I cannot use threads since the GLI will restrict them to a single process which results in an awful performance (especially since most of the threads are CPU bound).

So the obvious solution was to use the multiprocessing module. But I've encountered several problems:

  1. When using spawn to generate new processes, I get the "Apps aren't loaded yet" error when the new process imports the Django models. This is because the new process doesn't have the database connection given to the main process by python manage.py runserver. I circumvented it by using fork instead of spawn (like advised here) so the connections are copied to the other processes but I feel like this is not the best solution and there should be a clean way to start new processes with the necessary connections.

  2. When several of the processes simultaneously access the database, sometimes false results are given back (partly even from wrong models / relations) which crashes the program. This can happen in the initial startup when fetching data but also when the program is running. I tried to use ISOLATION LEVEL SERIALIZABLE (like advised here) by adding it in the options in the database settings but that didn't work.
    A possible solution might be using custom locks that are given to every process but that doesn't feel like a good solution as well.

So in general, the question is: Is there a good and clean way to use multiprocessing in Django without these issues? A way that new processes have the database connections without needing to rely on fork and that all processes can just access the database without having any race conditions sometimes producing false results like this?

One important thing: I don't use a Pool since the processes aren't running the same simple task. The processes are each running different specific tasks, share data via multiprocessing Signals, Queues, Values and Namespaces (shared memory) and new processes can be triggered by user interaction (websockets).
I've tried to look into Celery since this has been recommended on a lot of questions about Django and multiprocessing but I wouldn't know how to use something like that in the project structure with the specific different processes that need to be created at specific points and the data that gets transferred over the Queues, Signals, Values and Namespaces in the existing project.

Thank you for reading; any help is appreciated!

Korne127
  • 168
  • 14
  • I'm not sure there's a simple solution. It sounds like the Django/database-specific parts of your multiprocessing system (if it really is that complex) should be separated out. – AKX Aug 29 '22 at 13:55
  • 3
    `python manage.py runserver` should never be used in production, it's a minimal web server intended for local development. You should use a proper web server like gunicorn that can handle spawning as many processes as you like – Iain Shelvington Aug 29 '22 at 13:59
  • @IainShelvington It'll still be mightily inconvenient to not be able to use `runserver` in local development (if the implication here is "just use another WSGI server"). – AKX Aug 29 '22 at 14:03
  • @AKX Thanks for the response; what do you mean with separated out? The "emergency solution" would be to have all access to databases in the main process, but this wouldn't be the best solution since the other processes need to access the database as well. They'd need to get the data via inter process communication and threads in the main process just waiting for events to get and transfer the needed data (hopefully, at least different threads can access the database, if not I would have no idea on how to solve this). Did you mean this by separating out or what would you have in mind? – Korne127 Aug 31 '22 at 17:06
  • @IainShelvington Currently, we are just in local development. Would using gunicorn or a different web server solve these problems I described? If yes, can you explain on how to solve it that the different processes can simultaneously access the database? – Korne127 Aug 31 '22 at 17:08
  • Currently, I've had another idea: That with every new process, a setup function calling Django.setup() is first called before executing the real function. My hope was that with this way, every process would create an independent connection to the database so that the current system could work. However, it still throws errors like `django.db.utils.OperationalError: lost synchronization with server: got message type "1", length 976434746` on setup where I currently have no idea how to prevent them. – Korne127 Aug 31 '22 at 17:11

1 Answers1

1

With every new process, a setup function calling Django.setup() is first called before executing the real function. My hope was that with this way, every process would create an independent connection to the database so that the current system could work.

Yes - you can do that with initializer, as explained in my other answer from yesteryear.

However, it still throws errors like django.db.utils.OperationalError: lost synchronization with server: got message type "1", length 976434746

That means you're using the fork start method for subprocesses, and any database connections and their state has been forked into the subprocesses too, and they will be out of sync when used by multiple processes.

You'll need to close them:

def subprocess_setup():
    django.setup()
    from django.db import connections
    for conn in connections.all():
        conn.close()
    
with ProcessPoolExecutor(max_workers=5, initializer=subprocess_setup) as executor:
   
AKX
  • 152,115
  • 15
  • 115
  • 172
  • First of all, thanks very much for the help. I really appreciate it. I've spent the last days working on it and trying to get the multiprocessing to work and while there have still been problems, your answer has really helped me along to get a working solution, so thank you! – Korne127 Sep 03 '22 at 23:16
  • I've actually initially tried something pretty similar to this by calling an init_function for new processes `def _init_django(target, args):` `import django` `django.setup()` `target(*args)` (imagine new lines between the code segments). However, I've missed that the default was still set to fork, so thanks for adding that. And thanks very much for explaining how to close the connections so it can work both ways! – Korne127 Sep 03 '22 at 23:16
  • While I managed to get it to work both ways (Process & ProcessPoolExecutor), the Django ready function needs to be empty. Previously, the ready function has started the initial process, but since it's called for every new process in `django.setup()`, this is not possible. I could try my code via `python manage.py shell` and then starting another method for the initial process manually, but this doesn't create a web server and therefore doesn't work with the frontend. What is the real way to solve this? How do you usually start the code for the main process of the webserver if not in `ready()`? – Korne127 Sep 03 '22 at 23:18
  • Also, I initially still got an error (Apps aren’t loaded yet) both ways since the python module with the called function imports database models. Apparently the module level imports get scanned while transferring target & args to the new process (and before running the new process where `django.setup()` is called). With ProcessPoolExecutor, it’s only the imports of the module where the function is, with my _init_django way, it is also the imports of all objects (plus referenced objects) passed as parameters. – Korne127 Sep 03 '22 at 23:30
  • The obvious working solution was to move the imports to class / function level, but this isn’t a beautiful solution and with the parameters, it can be many files where you need to do this. So I wanted to ask if you know a better way that these imports don’t get "looked at" before `django.setup()` is called. – Korne127 Sep 03 '22 at 23:30
  • Update: I experimented around more and found pretty nice solutions to the import problem. I now have helper functions either using a Pool or a Process with the _init_django like I described before, but pickling the arguments before starting the process and unpickling after initialising Django. – Korne127 Sep 04 '22 at 13:28
  • Both don’t have the import problems anymore (although both have the disadvantage that multiprocessing.Queue() / .Event()s cannot be passed anymore; instead I need to use multiprocessing.Manager().Queue() / .Event()s. Nevertheless, it now completely works with these methods and I’m glad I’ve found a good solution not needing to remove all database imports in module level. The only remaining problem is the ready function: I could test the code with a hacky solution but I'd appreciate help on how to properly do this and start the code for the main process if not in ready(). – Korne127 Sep 04 '22 at 13:28