3

I'm wondering whether there's a native implementation in the multiprocessing module that would allow me to store running processes in a list-based structure and whenever a processes is finished with execution It's automatically removed from the list.

In code It'd look like this:

from multiprocessing import process

pool = [] # This data structure needs to prune non-running processes

class A(Process):
     def run():
         pass

for i in range(0, 10):
    worker = A().start()
    pool.append(worker)


# So if I want to iterate the pool now, It should only contain the alive processes

Another way to manage this would be to keep a dictionary:

pool = {
    processId: processObject
}

And then get the active process ids using psutil:

current_process = psutil.Process()
children = current_process.children(recursive=False)

However, what'd be the size of the object inside the dictionary once the process dies?

Kristijan
  • 195
  • 10

1 Answers1

2

I don't think such a hypothetical self-updating structure would be a good idea, for the same reason you shouldn't modify a list while you're iterating over it. Processes might get removed while you are iterating over the pool.

To iterate over it safely, you would need a snapshot and this would render the whole effort of such a structure pointless. When you need to update your pool-list, you better do so explicitly with for example:

pool[:] = [p for p in pool if p.is_alive()] # p are your processes

or if you want all process-wide, active child-processes and not just that in your custom pool:

[p for p in multiprocessing.active_children()]

You can of course put that somewhere in a function or a method and call it whenever you need an actual pool-list. Processes have a pid attribute so you wouldn't need psutil just for getting process ids.

Darkonaut
  • 20,186
  • 7
  • 54
  • 65
  • It doesn't have to be a list - it could be a dictionary. The thing is, this is a web application - and I wish to avoid updating the list on every request as with concurrency it could get messy - as I'd have to block access during the linear search. And the amount of worker processes grows linearly with the amount of users. – Kristijan Nov 09 '18 at 09:48
  • @Kristijan Who needs the pool data and for what? – Darkonaut Nov 09 '18 at 14:44
  • I want to expose the alive processes's progress through an API endpoint. – Kristijan Nov 12 '18 at 07:33
  • @Kristijan The problem is not to build such a structure, but to orchestrate consistent reading access. It doesn't make a difference with a dictionary if you only want alive processes contained. It would always have to be updated, the question is just if it happens at a time when you really need that information or instantly when nobody asks for it. You cannot expose a self-updating structure directly because iteration on it might fail due to removed datasets. – Darkonaut Nov 12 '18 at 12:16
  • @Kristijan You would either need read/write-locks (not available in stdlib) or (better) some sort of versioning with snapshots, basically the same functionality that databases provide to ensure consistency. In any way, the datastructure you expose would have to be static during reading from it. This adds much complexity. I can show you a self-updating structure if you think that answers your question, but you shouldn't expose it directly to read requests for the reasons mentioned before. You could use it to trigger updates on a real database, though. – Darkonaut Nov 12 '18 at 12:17
  • I'm aware of the complexities. That's why I asked whether there's a native solution provided by those who built the multiprocessing library - I wanted to avoid solving all these issues myself :) – Kristijan Nov 12 '18 at 15:27
  • @Kristijan This probably would go far beyond the basics the multiprocessing should provide. Let me know if you still need the self-updating structure itself or if you're happy with "No" as an answer ;). – Darkonaut Nov 12 '18 at 15:36
  • I'm happy with the no :) Thanks for the help. – Kristijan Nov 13 '18 at 09:24