0

I have a python script doing data processing. On laptop, it probably takes 30 days to finish. A single function is being executed through a for-loop for hundreds of times. Each time a new argument will be feed into the single function.

I am thinking to design some parallel/distributed computing manner to speed up the script: divide the for-loop into multiple docker containers, and each container is responsible for a subset of the for-loop but with different arguments.

Here is some pseudo code:

def single_fun(myargs):
   #do something here
   data = file_data(myargs)
   post_process(data)
def main_fun():
   for i in range(100):
       single_fun(i)

my idea:

  1. docker container 1: take argument 1~5, run function single_fun()
  2. docker container 2: take argument 6~10, run function single_fun()
  3. docker container n: take argument n-5 ~ n, run function single_fun()
  4. after each docker container being finished, i can copy the results from each container, to the host computer hard drive.

My question: Is my idea doable? any useful feedback here? Thanks. How do I implement this idea? Any framework or tool I can leverage to do finish this idea?

mindcoder
  • 377
  • 3
  • 11
  • Absolutely doable. Just might like to rethink the strategy, how to split the task-distribution, as the setup & termination costs of spawning a new docker-containerized job are non-zero, so the smarter the job-scheduling, the more time will be spent on actual job-processing ( perhaps block-orchestrated, to dissolve the high costs of spawning a docker among many-single_fun()-calls per such block-wise orchestration ). Anyway, pretty doable – user3666197 Feb 22 '22 at 21:52
  • You might find a job queue system a good match for this problem; RabbitMQ is a popular open-source option. Push all of the tasks into the queue, then run any number of containers that get tasks from the queue and execute them one at a time. In this setup you'd have some number of long-running containers that do the work, and you would not start new containers for each job. – David Maze Feb 22 '22 at 23:59
  • Thanks for suggestions! How do I implement this idea? – mindcoder Mar 02 '22 at 21:06
  • I would suggest Redis. It is very easy and it has Python, C/C++, Ruby, PHP and `bash` clients. So you set it up with `bash` or command-line, add and accept jobs with Python and monitor/inject test data with `bash`. Example here https://stackoverflow.com/a/22220082/2836621 – Mark Setchell Mar 02 '22 at 21:23
  • You need to carefully measure and benchmark whatever it is you are doing. It might be possible to do your calculations using Numpy or Numba in 1/1000th of the time. Or if you are only doing 200ms calculation for the overhead of spinning up a docker container, you should work out how to amortize the cost of starting a new container or process across a larger number of launches. In general, you want to make each process do lots and lots of things before exiting and having to start a new process. – Mark Setchell Mar 02 '22 at 21:51

0 Answers0