5

Lets assume I have a bash script that executes code like so:

for i in $LIST; do
 /path/to/my/script.sh $i &
done

As you can see, I am pushing these scripts into the background, and allowing the parent script to execute as many commands as it can, as fast as it can. The problem is that my system will eventually run out of memory, as these commands take a about 15 or 20 seconds to run each instance.

I'm running one static script.sh file, and passing a simple variable (i.e. customer number) into the script. There are about 20,000 - 40,000 records that I am looping through at any given time.

My question is, how can I tell the system to only have X number of script.sh instances running at once. If too many are running, I want to pause the script until the number of scripts are below the threshold, and then continue.

Any ideas?

Slickrick12
  • 897
  • 1
  • 7
  • 21
  • Just a follow up... I'm running one static script.sh file, and passing a simple variable (i.e. customer number) into the script. There are about 20,000 - 40,000 records that I am looping through at any given time. – Slickrick12 Jan 04 '12 at 22:40
  • Possible duplicate: http://stackoverflow.com/questions/6511884/, http://stackoverflow.com/questions/4260267/, http://stackoverflow.com/questions/1455695/ – mob Jan 05 '12 at 00:03

7 Answers7

5

Two tools can do this

(note I have changed your file selection around because I think you should prepare for handling strange filenames, e.g. with spaces)

GNU xargs

find -iname '*.txt' -print0 | xargs -0 -r -n1 -P4 /path/to/my/script.sh

Runs parallel on 4 processors

Xjobs

find -iname '*.txt' -print0 | xjobs -0 /path/to/my/script.sh

Runs on as many processors you have. Xjobs does a better job at separating output of the various jobs than xargs.

Add -j4 to run 4 jobs in parallel

sehe
  • 374,641
  • 47
  • 450
  • 633
3

One simple hack is to create a Makefile that executes each of the scripts and run make -jX:

all : $(LIST)

% : /path/to/my/script.sh
    $^ $*

A nice side-benefit is that make will auto-detect when your script has changed, but for this to be of use, you'd have to replace % with a template for the name of whatever output file your script generates for a given input parameter (assuming that's what it does). E.g.:

out.%.txt: /path…
Marcelo Cantos
  • 181,030
  • 38
  • 327
  • 365
2

You should use xargs with -P. Structure your script like this:


echo "$LIST" | xargs -n1 -P $SIMULTANEOUS_JOBS /path/to/my/script.sh

Where of course SIMULTANEOUS_JOBS is how many commands you want to run at once.

frankc
  • 11,290
  • 4
  • 32
  • 49
1

You might be interested in the parallel command from Joey Hess' moreutils package.[*] Usage would be

parallel -j MAXJOBS /path/to/my/script.sh -- $LIST

[*] Not to be confused with the more powerful, but harder to use, GNU parallel.

Fred Foo
  • 355,277
  • 75
  • 744
  • 836
1

A bash-only solution:

MAXJOBS=<your-requested-max + 3>
for i in $LIST; do
 /path/to/my/script.sh $i &
 while true; do
   NUMJOBS=`ps --ppid $$ -o pid= | wc | awk -F ' ' '{ print $1;}'`
   test $NUMJOBS -lt $MAXJOBS && break
done
Eugen Rieck
  • 64,175
  • 10
  • 70
  • 92
  • 1
    this is bash, so you can just say `jobs` instead of `ps --ppid ...`. And `wc -l` is more convenient and more efficient than `wc | awk -F ' ' '{ print $1;}'` – mob Jan 05 '12 at 00:04
  • For me, this ended up being the best (easiest) solution (with mob's tweaks). Runs pretty flawless, even with a ton of records. – Slickrick12 Jan 05 '12 at 19:49
0

GNU Parallel is designed for this kind of tasks:

parallel /path/to/my/script.sh ::: $LIST

This will run one script.sh on each core.

Watch the intro videos to learn more:

http://www.youtube.com/playlist?list=PL284C9FF2488BC6D1

Ole Tange
  • 1,990
  • 16
  • 10
-2

I always like to do a little recursion for this:

#!/bin/bash

max=3
procname="journal"

calltask()
{
    if [ "$(ps -ef | grep ${procname} | grep -v grep | wc -l)" -le "${max}" ]; then
       echo " starting new proc "
       calltask
    else
       echo "too many processes... going to sleep"
       sleep 5
       calltask
    fi
}

calltask
speeves
  • 1,358
  • 9
  • 10
  • grep returns a non-zero exit status and so does pgrep. There is no need for test, or wc. Also, the regex is very loose in that it can match unexpected things. The best bet here is pgrep if you were to take this (not recommend) approach. if pgrep $pocname >/dev/null; then ... – jordanm Jan 05 '12 at 00:23
  • I'm not sure I understand the issues with the code. The poster needed to know when a max number of processes was hit, and your example would just check the existence of the process. – speeves Jan 05 '12 at 03:35
  • This solution, while it might work some of the time, is problematic. What if the process name is a substring of another process name? Even if you could fix that up with -w or something, what is the op wants to run a process with the same name as other processes? Also, this solution has the very bad property that if the grep fails in the negative for some reason, you will completely bomb the system. – frankc Jan 05 '12 at 14:31
  • Youch! You guys are tough. In its defense, a more complex version was used to migrate our entire org, (4000+ users), to Google apps over a one year period. I also use it in https://github.com/speeves/serveradmindowntime which is a command line API browser, (though not in an infinite loop as shown here). Because of its proven track record, I still stand by it. – speeves Jan 05 '12 at 14:41
  • "It worked in environment X" doesn't make it good or portable. Lots of bad code might happen work in environment X, that doesn't make it good practice that should be spread to others. The link you have provided is full of bad practices, and pains me to look at. http://mywiki.wooledge.org/BashPitfalls – jordanm Jan 05 '12 at 16:20
  • Thanks for the feedback, jordanm. – speeves Jan 05 '12 at 16:32