1

I have an python program that works like this:

  1. Initialize data

  2. Call outside software to calculate outcome of data (using subprocess), read back in outside software's output

  3. manipulate output, prep it for going back to step 1.

I want to parallelize step 2 in a cluster environment (slurm), with it using a many-node environment.

I was trying to find the easiest approach to this, as I don't think subprocess will automatically use multiple nodes if allocated to the python program in a batch file.

I tried using dask-jobqueue, however this relies on creating a batch file for every worker, meaning I would have to make 10s of batch file calls and wait for them all to catch up to the code to utilize them all.

I was wondering if anyone has any advice, as this seems like it should be an easy thing to do.

Edit: this is more complex than just using multiprocessing, I think. This question gets at what I am trying to accomplish, I am wondering what the ideal package would be for this type of problem

  • There are many ways to do this. At the most basic level, there are two choices: 1) have your Python process make multiple simultaneous subprocess calls, each of which attacks a part of the problem as a single task, and 2) have Python make a single subprocess call to something that then breaks up the task and in turns utilizes some sort of parallelization to perform the task in parts in parallel.... – CryptoFool Apr 09 '19 at 21:26
  • #1 could be accomplished in a number of ways: a) multiple non-blocking subprocess calls in a single thread of execution, b) multiple Threads each making a subprocess call, or c) multiple multiprocessing blocks (low-overhead threads) making subprocess calls. #2 consists of almost unlimited possibilities. Maybe your cluster environment provides a natural way to do #2. If not, I'm old school, so I'd probably go with b) vs c). – CryptoFool Apr 09 '19 at 21:27
  • It appears that your question has already been asked. Check out this post. - This post seems to address all three of the possibilities I mentioned... – CryptoFool Apr 09 '19 at 21:35
  • Possible duplicate of [Python threading multiple bash subprocesses?](https://stackoverflow.com/questions/14533458/python-threading-multiple-bash-subprocesses) – CryptoFool Apr 09 '19 at 21:35
  • 1
    Thank you for the responses. Does subprocess take advantage of multiple nodes available to it on a cluster? – sealpancake Apr 09 '19 at 21:51
  • No, not directly. subprocess itself just launches a shell (usually Bash) and executes a single command in that shell. - (well, technically, by default it doesn't even use a shell...it executes a command directly. See the "shell" parameter to the various subprocess calls in the docs for that module). – CryptoFool Apr 09 '19 at 22:01
  • Sorry, perhaps I was not clear. I want to parallelize my subprocess calls across nodes, through (most likely) some sort of distributed programming way (MPI, dask, etc.) – sealpancake Apr 09 '19 at 22:04
  • Sorry. I was vague in my prior comment that I deleted. If you used subprocesses, you'd be executing a local command, but then turning around and running something that dispatched to a remote machine, like 'ssh'. There are 'ssh' libraries for doing this. I've used Fabric to do this. It takes a host name or IP and allows you "talk" to a remote machine. That might work for your purpose. – CryptoFool Apr 09 '19 at 22:09
  • I haven't done this specific thing using Python. I just Googled for "python distributed computing" and came up with quite a few things. Like : http://dispy.sourceforge.net/. Perform that google search. There's lots of good stuff on the first page. - this is a very broad subject. There aren't even just "a few", much less one right way to do this. What would work "best" for you would be a function of your exact use case. – CryptoFool Apr 09 '19 at 22:13
  • Took a quick look at Dask. That seems at first glance to be just for multiprocessing on a single host. - also check out [Fabric](http://www.fabfile.org/) - sorry I don't have a "sure, just do this!' answer. – CryptoFool Apr 09 '19 at 22:18
  • Thanks, this helps. I will update with what I did once I figure it all out – sealpancake Apr 10 '19 at 23:19
  • Cool. I'd love to hear what solution you pick and how it works out for you. – CryptoFool Apr 10 '19 at 23:20

1 Answers1

2

It seems the best way to approach this problem depends wildly on the capacity of the cluster size, environment, etc. that you are working with. The best case for me was to use MPI4py, which will isolate my subprocess calls and use them across my X amount of nodes (step 2), and have my head node run the rest of the code (Steps 1&3). This allows my slurm reservation to stay constant, rather than having to request nodes at every loop, or request nodes during the program run (like dask-jobqueue).