Parallel processing with xargs in bash

Question

I had a small script where I would source into each openstack's tenant and fetch some output with the help of python. It took too long for the reports to get generated and I was suggested to use xargs. My earlier code was like below.

#!/bin/bash
cd /scripts/cloud01/floating_list

rm -rf ./reports/openstack_reports/
mkdir -p ./reports/openstack_reports/

source ../creds/base
for tenant in A B C D E F G H I J K L M N O P Q R S T
do
  source ../creds/$tenant
  python ../tools/openstack_resource_list.py > ./reports/openstack_reports/$tenant.html

done
lftp -f ./lftp_script

Now I have put xargs in the script and the script looks something like this.

#!/bin/bash
cd /scripts/cloud01/floating_list

rm -rf ./reports/openstack_reports/
mkdir -p ./reports/openstack_reports/

source ../creds/base

# Need xargs idea below
cat tenants_list.txt | xargs -P 8 -I '{}' # something that takes the tenant name and source
TENANT_NAME={}
python ../tools/openstack_resource_list.py > ./reports/openstack_reports/$tenant.html
lftp -f ./lftp_script

In this script how am I supposed to implement source ../creds/$tenant? Because while each tenant is dealt with, it needs to be sourced as well and I am not sure how to include that with xargs for parallel execution.

tripleee · Accepted Answer · 2017-12-22T07:52:59.260

2

xargs can't easily run a shell function ... but it can run a shell.

# If the tenant names are this simple, don't put them in a file
printf '%s\n' {A..T} |
xargs -P 8 -I {} bash -c 'source ../creds/"$0"
      python ../tools/openstack_resource_list.py > ./reports/openstack_reports/"$0".html' {}

Somewhat obscurely, the argument after bash -c '...' gets exposed as $0 inside the script.

If you want to keep the tenants in a file, xargs -a filename is a good way to avoid the useless use of cat, though it's not portable to all xargs implementations. (Redirecting with xargs ... <filename is obviously completely portable.)

For efficiency, you could refactor the script to loop over as many arguments as possible:

printf '%s\n' {A..T} |
xargs -n 3 -P 8 bash -c 'for tenant; do
      source ../creds/"$tenant"
      python ../tools/openstack_resource_list.py > ./reports/openstack_reports/"$tenant".html
  done' _

This will run a maximum of 8 parallel shell instances with a maximum of 3 tenants assigned to each (so in actual fact only 7 instances), though with this small number of arguments, the difference in performance is probably negligible.

Because we are now actually receiving a list of arguments, we pass _ as the value to populate $0 with (just because it needs to be set to something, in order to get the real arguments in place properly).

If the source might make modifications which are not always guaranteed to be overwritten by the source in the next iteration (say, some tenants have variables which need to be unset for some other tenants?) that complicates matters, but maybe post a separate question if you really actually need help resolving that; or just fall back to the first variant where each tenant is run in a separate shell instance.

edited Dec 22 '17 at 07:52

answered Dec 22 '17 at 07:42

tripleee

175,061
34
275
318

thanks so much. I am getting the result as expected. Accepting your answer. This does land me into some other issues which is from my python file but that is another story now. :) – Heenashree Khandelwal Dec 22 '17 at 08:13
adding xargs increased CPU load on the server from which I am requesting the data. I added sleep of 10 seconds to create a small lag. Do you have any suggestions on this? – Heenashree Khandelwal Jan 24 '18 at 11:42
Parallel processing by definition increases load. If the remote server is not prepared to handle that sort of load, reduce parallelism, perhaps only by a bit (4 instead of 8 maybe?) but eventually back to serialized if that's more convenient for the system as a whole. – tripleee Jan 24 '18 at 11:59
If your Python script accesses a remote server and the Python API allows you to set a priority or "nice" level, that would allow you to communicate to the server that this is something which is allowed to take a while longer if the load is high; but this is going pretty far into speculative country. – tripleee Jan 24 '18 at 12:04
I do not have that authority to set priority on nice level. And I am also required to generate reports faster but keep load to a minimum. It's like a deadlock. :( – Heenashree Khandelwal Jan 30 '18 at 05:10
Without knowing what your Python script does it's hard to suggest improvements. If you can amortize the cost over a longer time the load increase at any given time will be negligible. But if the report needs to show the status of a volatile resource in near real time, of course you can't have a result which might be several minutes old. – tripleee Jan 30 '18 at 05:34

score 2 · Answer 2 · answered Dec 23 '17 at 18:48

With GNU Parallel it looks like this:

#!/bin/bash
cd /scripts/cloud01/floating_list

rm -rf ./reports/openstack_reports/
mkdir -p ./reports/openstack_reports/

source ../creds/base
doit() {
  source ../creds/"$1"
  python ../tools/openstack_resource_list.py > ./reports/openstack_reports/"$1".html
}
env_parallel doit ::: {A..T}
lftp -f ./lftp_script

env_parallel copies the environment into each command - including functions. It then runs parallel which runs one job per core in parallel.

Depending on the task it may be faster or slower to run more or fewer in parallel. Adjust with -j8 for 8 jobs in parallel or -j200% for 2 jobs per core.

Parallel processing with xargs in bash

2 Answers2

Linked