0

I have this small script to download images from a given list in a file.

FILE=./img-url.txt
while read line; do
url=$line
wget -N -P /images/ $url
wget -N  -P /images/ ${url%.jpg}_{001..005}.jpg
done < $FILE

The problem is, that It runs too long (>5000 lines in the file). Is there any way to speed up things? Like split source txt into separate files and run multiple wget instances at the same time.

Adrian
  • 2,576
  • 9
  • 49
  • 97
  • 1
    Relevant: [Parallel wget in Bash](https://stackoverflow.com/questions/7577615/parallel-wget-in-bash) – jDo Apr 02 '16 at 23:17

1 Answers1

2

There are a number of ways to go about this. GNU Parallel would be the most general solution, but given how you posed your question, yes, split the file into parts and run the script on each part simultaneously. How many pieces to split the file into is an interesting question. 100 pieces would mean spawning 100 wget processes simultaneously. Almost all of those will sit idle while a very few utilize all the network bandwidth. One process might utilize all the bandwidth for an hour for all I know, but I'm going to guess a good compromise is to split the file into four files, so 4 wget processes run simultaneously. I'm going to call your script geturls.sh. Type this at the command line.

split -l 4 img-url.txt
for f in xaa xab xac xad; do
    ./geturls.sh $f &
done

This splits your file into four ~even pieces. The split command output files are by default given some bland file names, in this case xaa, xab, etc. The for loop takes the names of those pieces and gives them to geturl.sh as a command line argument, the first thing on the command line after the program name. The geturls.sh is put into the background (&) so the next iteration of the loop can happen immediately. In this way geturls.sh is run on all four pieces of the file virtually simultaneously, so you've got 4 wget processes going at the same time.

The contents of geturls.sh is

#!/bin/bash
FILE=$1
while read line; do
url=$line
wget -N -P /images/ $url
wget -N  -P /images/ ${url%.jpg}_{001..005}.jpg
done < $FILE

The only change I made to your code was the explicit declaration of the shell (out of habit mostly) and also that FILE is now assigned the value in the $1 variable. Recall that $1 is the (first) command line argument, which is here the name of one of the pieces of your img-url.txt file.

Erik Bryer
  • 96
  • 6
  • Perfect, but -l switch should be -n switch in your code. I can't edit one character in your code. – Adrian Apr 03 '16 at 10:35
  • one more question: the script does not exit at the end. Where should I put "exit 0" command? I tried before done (both), but it does not help. – Adrian Apr 03 '16 at 11:35
  • Seems like -n and -l would do the same thing, but I'll take your word for it. :) The "exit 0" command is superfluous. If a scripts gets to the bottom, the default assumption is it completed correctly. But if there is some non-zero exit status, you'd want to know about it. So setting exit 0... I can't think of a good reason. I mean, if there were some problem, then setting exit 0 would cover that up. You always want to have the exit status convey some useful information if possible. Which isn't always easy. But setting it to a single value each time removes that possibility. – Erik Bryer Apr 04 '16 at 01:35