I made a loop in my bash script.
I made a loop where an important thing is it must wait the end of some processes(here scrapy spiders) to be done before to increment variables, which are essential as conditions.
The general algorithm is the following (no programming language used here):
#initialisation
count=0
urlsFileNbLines=$( wc -l < urlsFileToScrape )
while count =<5 or urlsFileNbLines != 0
launch scrapy spiders
wait scrapy spiders are done
add 1 (loop) to $count
update $urlsFileNbLines
So, the problem is if I don't use the condition to wait the processes to be done (until scrapy spiders are done
) , and in the same time I increment the variables, it will launch again scrapy spiders while I must wait the previous are done to update $urlsFileNbLines
.
Now, I gonna tackle the bash language part.
To make this condition: until scrapy spiders are done
, I was inspired by this.
What I understand in Bash shell script to check running process part, is if pgrep -x scrapy
returns something, then this is true
, implicitly. So I tried to make a condition where it has to wait until it's false
.
That's why I tried to make until [ ! pgrep -x scrapy ]; do ...
and even until [ ! $(pgrep -x scrapy) ]; do
but it always give errors.
I tried this too:
At launch scrapy spiders
, there is:
for i in `seq 1 5`; do
scrapy crawl spider -a param=$i & PID$i=$! &
done
echo "here it must wait the end of process. Count value is ${count}"
At until scrapy spiders are done
(just below the previous), there is:
for in in `seq 1 5`; do
wait $PID$i
done
count=$(($count+1))
...
But it does not wait, it makes incrementation of $count
very quickly and goes over 5 because the loop for i in seq $1 $maxSeq
is not finished that it continues to increment, while the part & PID$i=$! &
returns issue script.sh: line 93: PID1=4758 : command not found
. That's messy.
What can I do ?
UPDATE
Thanks to @Barmar I made this solution. And to wait all processes I inspired from this topic.
pid=() #an empty array for the pids that are coming
for i in `seq 1 5`; do
scrapy crawl spider -a param=$i ; pid[$i]=$! &
done
echo "command pgrep before wait"
echo $(pgrep -x scrapy)
for pid in ${pid[*]}; do
wait $pid & echo 'fin du processus ${pid}' &
done
echo "command pgrep after wait"
echo $(pgrep -x scrapy)
It is not waiting for the whole processes to be done before to increment variables. Then it launches again same instances of spiders and it creates conflicts of connection.
To write: wait "${pids[@]}"
instead of the loop works perfectly in my case.
bash version: 4.4.20 | OS version: Ubuntu 18.04.3 LTS.