cURL with variables and multiprocessing in shell

Question

I've been trying to solve this for a few days with no luck. What I am trying to do is to feed my curl json with my local IPs, process multiple cURLs as fast as possible in and receive variables back to a file.

My first code is running fine, but he is processing line by line and it is taking eternity. I would like to run something like xargs or parallel.

I have the following .txt file (IP.txt):

192.168.1.100
192.168.1.102
192.168.1.104
192.168.1.105
192.168.1.106
192.168.1.168
...

I am feeding this file to a code:

cat IP.txt | while read LINE; do

C_RESPONSE=$(curl -s -X POST -H "Content-Type: application/json" --data '{"method":"data","params":[]}' $LINE:80 | jq -r '.result[]')

for F_RESPONSE in $C_RESPONSE; do
echo $LINE $F_RESPONSE >> output.txt

done
done

Output of this script is following:

192.168.1.100 value_1
192.168.1.100 value_2
192.168.1.100 value_3
192.168.1.100 value_4
192.168.1.100 value_5
192.168.1.102 value_1
192.168.1.102 value_2
192.168.1.102 value_3
192.168.1.104 value_1
192.168.1.104 value_2
192.168.1.104 value_3
192.168.1.104 value_4
192.168.1.104 value_5
192.168.1.104 value_6
192.168.1.104 value_7
192.168.1.104 value_8
192.168.1.104 value_9
192.168.1.104 value_10
192.168.1.105 value_1
192.168.1.105 value_2
192.168.1.106 value_1
192.168.1.168 value_1
...

I would like to make this code faster with parallel or xargs or even &. However adding &:

C_RESPONSE=$(curl -s -X POST -H "Content-Type: application/json" --data '{"method":"data","params":[]}' $LINE:80 | jq -r '.result[]') &

I'm sending script to the background and I am unable to process

for F_RESPONSE in $C_RESPONSE; do
echo $LINE $F_RESPONSE >> output.txt

With parallel command like this I am able to produce only values but I can't see IP:

cat IP.txt | parallel -j200 "curl -H 'Content-Type: application/json' {}:80 -X POST -d '{\"method\":\"data\",\"params\":[]}'" | jq -r '.result[]' >> output.txt

value_1
value_2
value_3
value_4
value_5
value_1
value_2
value_3
value_1
value_2
value_3
value_4
value_5
value_6
value_7
value_8
value_9
value_10
value_1
value_2
value_1
value_1
...

I've tried googling and reading many tutorials, but no luck. How can I solve this problem?

Thanks!

So here is a quick solution as proposed by @Poshi. Solution is without limiter so can cause problems if too many background jobs will be running.

#!/bin/bash

function call() {
    arg1=$1
    C_RESPONSE=$(curl -s -X POST -H "Content-Type: application/json" --data '{"method":"data","params":[]}' $arg1:80 | jq -r '.result[]')

    for F_RESPONSE in $C_RESPONSE; do
    echo $arg1 $F_RESPONSE >> output.txt
done
}

cat IP.txt | while read LINE; do

call $LINE &

done

Maybe you can encapsulate the cURL+postprocessing in a function, and send that function into background. That way you will be able to process the results and you won't have to resort to more complicated tools than the standard bash job management. — Poshi, Jan 24 '19 at 11:26
Thank you very much for great advise! It worked as desired. #!/bin/bash function call() { arg1=$1 C_RESPONSE=$(curl -s -X POST -H "Content-Type: application/json" --data '{"method":"data","params":[]}' $arg1:80 | jq -r '.result[]') for F_RESPONSE in $C_RESPONSE; do echo $arg1 $F_RESPONSE >> output.txt done } cat IP.txt | while read LINE; do call $LINE & done — bashnewbie, Jan 24 '19 at 12:55
the quick solution is not efficient, It may cause severe performance issues if you do not limit number of running background processes for your curl script — Derviş Kayımbaşıoğlu, Jan 24 '19 at 13:11
@Simonare you are right, already killed my machine with that. Need to add some kind of limiter. Any ideas? — bashnewbie, Jan 24 '19 at 13:34
[DontReadLinesWithFor](https://mywiki.wooledge.org/DontReadLinesWithFor) and [BashFAQ #1](http://mywiki.wooledge.org/BashFAQ/001) are pertinent. — Charles Duffy, Jan 24 '19 at 15:01
BTW, all-caps variables are in reserved namespace -- you should use lower-case names for variables you define. See http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap08.html, fourth paragraph, keeping in mind that environment and shell variables share a namespace (setting the latter will overwrite the former when names clash). — Charles Duffy, Jan 24 '19 at 15:07
As another aside, putting `>>output.txt` or your individual `echo` lines is crazy inefficient, because it reopens the output file every time it needs to just to write one line, and closes it and flushes it after each line. Much better to open the handle just once and reuse it, as by putting `>output.txt` after the `done` to redirect stdout for the whole loop. — Charles Duffy, Jan 24 '19 at 15:09
I am aware >>output.txt is not the most elegant solution, however I am constantly working with that file with another script and I would like to read it as much up to date as possible. Don't know solution other than >> — bashnewbie, Jan 24 '19 at 15:44
`echo` still `write()`s, making content visible to the VFS layer and thus to other processes; it just doesn't force more of a flush than that (so it doesn't force the inode to be updated with a new mtime, f/e). Which is to say, I *gave* you a solution, two comments ago. :) — Charles Duffy, Jan 24 '19 at 16:02

Derviş Kayımbaşıoğlu · Answer 1 · 2019-01-24T16:12:47.430

0

as an improvement to your code, you may consider to add control for number of processes initialized. check my answer below

#!/bin/bash

function call() {
    arg1=$1
    C_RESPONSE=$(curl -s -X POST -H "Content-Type: application/json" --data '{"method":"data","params":[]}' $arg1:80 | jq -r '.result[]')

    for F_RESPONSE in $C_RESPONSE; do
    echo $arg1 $F_RESPONSE >> output.txt
done
}

cat IP.txt | while read LINE; do

while (($(pgrep -P "$$" curl | wc -l)> 10))
do
    sleep 0,2;
done

call $LINE &

done

edited Jan 24 '19 at 16:12

answered Jan 24 '19 at 13:42

Derviş Kayımbaşıoğlu

28,492
4
50
72

With code as is I am getting start.sh: line 19: syntax error near unexpected token `done' start.sh: line 19: `done'. So after little tweak while($(pgrep init | wc -l)> 10); do and done at the end I'm getting start.sh: line 14: 0: command not found. Not completely sure what it is because I do have pgrep and wc installed and I am able to call them from console with pgrep --help and wc --help – bashnewbie Jan 24 '19 at 14:47
the init part is needed to be changed to curl – Derviş Kayımbaşıoğlu Jan 24 '19 at 14:54
I updated my answer – Derviş Kayımbaşıoğlu Jan 24 '19 at 15:02
2

You realize that `[[ 2 > 10 ]]` is true? You can't use `>` in `[[ ]]` for comparing numbers safely; switch to either `(( ))` or `-gt`. – Charles Duffy Jan 24 '19 at 15:08
improved thank you :) – Derviş Kayımbaşıoğlu Jan 24 '19 at 15:19
1

`pgrep curl` is pretty messy -- you're counting *all* `curl` instances, not just the ones started by the same script. At least make it `pgrep -P "$$" curl`, so it's limited to `curl` instances started by the current bash instance. – Charles Duffy Jan 24 '19 at 16:04

score 0 · Answer 2 · answered Jan 24 '19 at 15:04

0

xargs -P is a tool built for the job. (GNU parallel even moreso, but it's a mess of perl with semantics that make its use fault-prone so I cannot recommend its use; see the mailing list thread at https://lists.gnu.org/archive/html/bug-parallel/2015-05/msg00005.html).

call() {
  : # put your definition here
}
export -f call # make that function accessible to child processes

# tell xargs to start 4 shells (adjust to taste!) processing lines.
# presently, this gives each shell 16 jobs to reduce startup overhead
# ...adjust to tune for your actual workload.
<IP.txt xargs -n 16 -d $'\n' -P4 bash -c 'for line; call "$line"; done' _

answered Jan 24 '19 at 15:04

Charles Duffy

280,126
43
390
441

I'd like to ask if you know which solution is faster. Or if there is any way to run some kind of benchmark on both solutions. To stick up with bash loop and function or use xargs. I'd like to make it as snappy as possible on regular machine like i7 16GB on local network where internet speed is really not a limitation. Without using python or java or c. I can do "something" in bash only at this moment. – bashnewbie Jan 24 '19 at 16:49
On a system where internet speed isn't a limitation, you'll want a much larger value than `-P4`. Maybe start with `-P16`, measure throughput, and tune from there? This is certainly going to be lower-latency than the shell-loop hack -- you don't need any equivalent to the `sleep 0.2` it uses when you're waiting for `SIGCHILD` events to provide immediate notice when a subprocess exits (whereas if you took out that `sleep` call, the other answer would be spinning its wheels running `pgrep` over and over and over, which waiting for an actual affirmative notice from the OS avoids). – Charles Duffy Jan 24 '19 at 20:43
...also, if your list of IPs is fairly short, you'll be better off with a lower `-n` value; if it's long, it doesn't make much of a difference. – Charles Duffy Jan 24 '19 at 20:44

cURL with variables and multiprocessing in shell

2 Answers2

Linked