GNU Parallel as job queue with named pipes

Question

I followed the sample code to create a gnu parallel job queue as below

// create a job queue file
touch jobqueue

//start the job queue
tail -f jobqueue | parallel -u php worker.php 

// in another shell, add the data 
while read LINE; do echo $LINE >> jobqueue; done < input_data_file.txt

This approach does work and handles the job as a simple job queue. But there are two problems

1- reading data from input file and then writing it to the jobqueue (another file) is slow as it involves disk I/O.

2- if for some reason my job aborts in the middle, and I restart the parallel processing, it will re-run all the jobs in the jobqueue file

I can add a script in worker.php to actually remove the line from jobqueue when the job is done, but I feel there is a better way to this.

Is it possible that instead of using

tail -f jobqueue

I can use a named pipe as input to parallel and my current setup can still work as a simple queue?

I guess that way I won't have to remove the lines from pipe which are done as that will be automatically removed on read?

P.S. I know and I have used RabbitMQ, ZeroMQ (and I love it), nng, nanomsg, and even php pcntl_fork as well as pthreads. So it is not a question of what is there for parallel processing. It is more of a question to create a working queue with gnu parallel.

In your attempt now, you are opening the file handle for write once for every input line you _read_. Move the `>> jobqueue` after the `done` — Inian, Oct 23 '18 at 09:31
Have a read of a method using **Redis** here https://stackoverflow.com/a/22220082/2836621 You can also use BRPOPLPUSH to ensure you don't lose items https://redis.io/commands/brpoplpush — Mark Setchell, Oct 23 '18 at 09:32
Have a look at `mkfifo` and https://unix.stackexchange.com/a/154403/187122. — Socowi, Oct 23 '18 at 11:45

score 1 · Answer 1 · answered Oct 23 '18 at 19:32

while read LINE; do echo $LINE >> jobqueue; done < input_data_file.txt

This can be done muuuch faster:

cat >> jobqueue < input_data_file.txt

While a fifo may work, it will block. That means you cannot put a lot in the queue - which sort of defeats the purpose of a queue.

I am surprised if disk I/O is an issue for reading the actual jobs: GNU Parallel can run 100-1000 jobs per second. Jobs can at most be 128 KB, so at the very most your disk has to deliver 128 MB/s. If you are not running 100 jobs per second, then disk I/O of the queue will never be an issue.

You can use --resume --joblog mylog to skip jobs already run if you restart:

# Initialize queue
true >jobqueue
# (Re)start running the queue 
tail -n+0 -f jobqueue | parallel --resume --joblog mylog

GNU Parallel as job queue with named pipes

1 Answers1

Linked