0

I've been try to make cURL on a huge loop and I run the cURL into background process with bash, there are about 904 domains that will be cURLed

and the problem is that 904 domains can't all be embedded because of the PID limit on the Linux kernel. I have tried adding pid_max to 4194303 (I read in this discussion Maximum PID in Linux) but after I checked only domain 901 had run in background proccess, before I added pid_max is only around 704 running in the background process.

here is my loop code :

count=0
while IFS= read -r line || [[ -n "$line" ]]; 
    do
      (curl -s -L -w "\\n\\nNo:$count\\nHEADER CODE:%{http_code}\\nWebsite : $line\\nExecuted at :$(date)\\n==================================================\\n\\n" -H "X-Gitlab-Event: Push Hook" -H 'X-Gitlab-Token: '$SECRET_KEY --insecure $line >> output.log) &

  (( count++ ))
done < $FILE_NAME

Anyone have another solution or fix it to handle huge loop to run cURL into background process ?

0x00b0
  • 343
  • 1
  • 3
  • 17
  • xargs `-P` option may help, to limit the number of process at a time – Nahuel Fouilleul Jul 11 '19 at 08:08
  • can u explain it ? – 0x00b0 Jul 11 '19 at 08:14
  • i added comment before the code was posted, seems will not be easy because of count variable, otherwise could be `xargs -n1 -P50 bash -c '....' - < "$FILE_NAME"`, where 50 is the number of running process at a time, and `.... ` the command to execute, and `"$1"` the argument to be used, 'the `-` is for `"$0"` – Nahuel Fouilleul Jul 11 '19 at 08:18
  • ` --process-slot-var=count` can be used to pass the index from [this answer](https://unix.stackexchange.com/a/449225/23266) – Nahuel Fouilleul Jul 11 '19 at 08:25

2 Answers2

2

a script example.sh can be created

#!/bin/bash

line=$1
curl -s -L -w "\\n\\nNo:$count\\nHEADER CODE:%{http_code}\\nWebsite : $line\\nExecuted at :$(date)\\n==================================================\\n\\n" -H "X-Gitlab-Event: Push Hook" -H 'X-Gitlab-Token: '$SECRET_KEY --insecure $line >> output.log

then the command could be (to limit number of running process at a time to 50)

xargs -n1 -P50 --process-slot-var=count ./example.sh < "$FILE_NAME"
Nahuel Fouilleul
  • 18,726
  • 2
  • 31
  • 36
2

Even if you could run that many processes in parallel, it's pointless - starting that many DNS queries to resolve 900+ domain names in a short span of time will probably overwhelm your DNS server, and having that many concurrent outgoing HTTP requests at the same time will clog your network. A much better approach is to trickle the processes so that you run a limited number (say, 100) at any given time, but start a new one every time one of the previously started ones finishes. This is easy enough with xargs -P.

xargs -I {} -P 100 \
    curl -s -L \
        -w "\\n\\nHEADER CODE:%{http_code}\\nWebsite : {}\\nExecuted at :$(date)\\n==================================================\\n\\n" \
        -H "X-Gitlab-Event: Push Hook" \
        -H "X-Gitlab-Token: $SECRET_KEY" \
        --insecure {} <"$FILE_NAME" >output.log

The $(date) result will be interpolated at the time the shell evaluates the xargs command line, and there is no simple way to get the count with this mechanism. Refactoring this to put the curl command and some scaffolding into a separate script could solve these issues, and should be trivial enough if it's really important to you. (Rough sketch:

xargs -P 100 bash -c 'count=0; for url; do

        curl --options --headers "X-Notice: use double quotes throughout" \
            "$url"
        ((count++))
    done' _ <"$FILE_NAME" >output.log

... though this will restart numbering if xargs receives more URLs than will fit on a single command line.)

tripleee
  • 175,061
  • 34
  • 275
  • 318
  • TIL `xargs --process-slot-var` exists but it's a relatively new feature and GNU only. If you have it, by all means use it. – tripleee Jul 11 '19 at 09:13