0

I have a script that's performing an independent task on about 1200 different files. It loops through each file and checks if it has already been completed or is in progress, if it hasn't been done and isn't being actively worked on (which it wouldn't be if it's not being run in parallel) then it performs a task with the file. This follows the general outline below:

myScript.py:

for file in directory:
    fileStatus = getFileStatus(file)
    if fileStatus != 'Complete' and fileStatus != 'inProgress':
        setFileStatus(file, 'inProgress')
        doTask(file)
        setFileStatus(file, 'Complete')

doTask() takes 20-40 minutes on my machine and will arc from minimal RAM requirements at the beginning to about 8GB toward the middle, and back down to minimal requirements at the end. Depending on the file this will occur over a variable amount of time.

I would like to run this script in parallel with itself so that all tasks are completed in the least amount of time possible, using the maximum amount of my machine's resources. Assuming (in ignorance) the limiting resource is RAM (of which my machine has 64GB), and that the scripts will all have peak RAM consumption at the same time, I could mimic the response to this question in a manner such as:

python myScript.py & 
python myScript.py & 
python myScript.py & 
python myScript.py & 
python myScript.py & 
python myScript.py & 
python myScript.py & 
python myScript.py & 

However, I imagine I could fit more in depending on where each process is in its execution.

Is there a way to dynamically determine how many resources I have available and accordingly create, destroy or pause instances of this script so that the machine is working at maximum efficiency with respect to time? I would like to avoid making changes to myScript and instead call it from another which would handle the creating, destroying and pausing.

asheets
  • 770
  • 1
  • 8
  • 28
  • 2
    have you checked `gnu-parallel`? https://www.gnu.org/software/parallel/parallel_tutorial.html#Number-of-simultaneous-jobs also the `Limiting the resources` section. – Jason Hu Feb 22 '18 at 21:46
  • 1
    gevent & eventlet may suffice your use case. – Reck Feb 22 '18 at 21:50
  • 1
    Try using some cool [Concurrency & Parallelism Modules listed](https://github.com/vinta/awesome-python#concurrency-and-parallelism) on [https://github.com/vinta/awesome-python](https://github.com/vinta/awesome-python). – Reck Feb 22 '18 at 22:03
  • Probably not RAM, with 64 GB to work with. More likely limiting factors are processors and disk. Good intro to the topic: https://youtu.be/9zinZmE3Ogk. One of the biggest bottle necks for you would be the spreadsheet itself, since every instance would need to acquire a write lock to update it. You probably want to switch to some kind of queue (whether in memory or an external one). – jpmc26 Feb 23 '18 at 00:39
  • @jpmc26 Thank you for the link! I haven't had time to watch much of it, but so far it looks like an excellent resource, so thanks again! As for the spreadsheet, since each task requires such a considerable amount of time and I'm actually using a google sheet, I didn't think write lock would be an issue, but please let me know if I'm mistaken. – asheets Feb 23 '18 at 02:47

2 Answers2

1

GNU Parallel is built for doing stuff like:

python myScript.py & 
python myScript.py & 
python myScript.py & 
python myScript.py & 
python myScript.py & 
python myScript.py & 
python myScript.py & 
python myScript.py & 

It also has some features to do resource limitation. Finding the optimal number is, however, really hard given that:

  • Each job runs for 20-40 minutes (if this was fixed, it would be easier)
  • Has a RAM usage envelope like a mountain (if it stayed at the same level all through the run, it would be easier)

If the 64 GB RAM is the limiting resource, then it is always safe to run 8 jobs:

cat filelist | parallel -j8 python myScript.py

If you have plenty of CPU power and is willing to risk wasting some, then you can run start a job if there is 8 GB memory free and if the last job was started more than 3 minutes ago (assuming jobs reach their peak memory usage within 3-5 minutes). GNU Parallel will kill the newest job and put it back on the queue, if the free memory goes below 4 GB:

cat filelist | parallel -j0 --memlimit 8G --delay 300 python myScript.py
Ole Tange
  • 31,768
  • 5
  • 86
  • 104
  • This sounds awesome! I'm curious, how would this handle processes that may exit in an error? Would new process be created to take its place? Additionally, I am brand new to linux and would appreciate a quick description of what these flags are doing – asheets Feb 25 '18 at 17:02
  • By default GNU Parallel does not retry a command which exits with an error. You can ask it to by using `--retries 5`. The text describes the used options, so please mention exactly which options you do not find described adequately. Or read about the options in `man parallel`. – Ole Tange Feb 26 '18 at 08:49
0

Update:
Thanks for clarifying it further.
However, with the requirements and approach you just mentioned, you are going to end up reinventing multi-threading. I suggest you avoid the multiple script calls and have all control inside your loop(s) (like the one in my original response).
You are probably looking for querying the memory usage of the processes (like this).
One particular component that might help you here is setting priority of the individual tasks (mentioned here).
You may find this link particularly useful for scheduling priority of tasks.
Infact, I recommend using threading2 package here, since it has inbuilt features on priority control.



Original Response:
Since you have roughly identified which parts require how much memory, you may employ multithreading pretty easily.

import threading

thread1 = threading.Thread(target=process1 , args=(yourArg1,)) # process1 takes 1 GB
thread2 = threading.Thread(target=process2 , args=(yourArg1,)) # process2 takes 1 GB

threadList1 = [thread1,thread2]

thread3 = threading.Thread(target=process3 , args=(yourArg1,)) # process3 takes 0.5 GB
thread4 = threading.Thread(target=process4 , args=(yourArg1,)) # process4 takes 0.5 GB

threadList2 = [thread3,thread4]


# Batch1 : 
for thread in threadList1:
    thread.start()
for thread in threadList1:
    thread.join()


# Batch2 : 
for thread in threadList2:
    thread.start()
for thread in threadList2:
    thread.join()
murphy1310
  • 647
  • 1
  • 6
  • 13
  • Could you explain why you broke it into two lists of two processes? – asheets Feb 22 '18 at 23:10
  • ""so for the parts of the script that may only require 1GB, I would like to create ~64 instances and run through until the portion of the script that requires something like 8GB."" notice that threads 1 and 2 take 1 GB of RAM, while threads 3 and 4 take lesser. I was trying to suggest to classify your threads that you intend to run in a batch. All threads in a batch (/threadList) would run together in parallel. Once you are through with 1 batch, you could then move on to another.... btw, the 'join()' function call helps you wait until the thread is completed. – murphy1310 Feb 22 '18 at 23:19
  • We can't really say whether threading is appropriate without knowing more about the OP's use case. Threading only helps with algorithms bottlenecked by external systems (disk, web services), but we don't know if the OP's might be processing bound or not. – jpmc26 Feb 23 '18 at 00:46
  • @murphy1310 I understand. My question was ambiguous and poorly articulated. I have edited it (hopefully) for clarity. Each process is identical, but depending on the input file will have a slightly different resource use / time distribution – asheets Feb 23 '18 at 02:26