How to download URLs in a csv and naming outputs based on a column value

Question

1. OS: Linux / Ubuntu x86/x64

2. Task:

Write a Bash shell script to download URLs in a (large) csv (as fast/simultaneous as possible) and naming each output on a column value.

2.1 Example Input:

A CSV file containing lines like:

001,http://farm6.staticflickr.com/5342/a.jpg
002,http://farm8.staticflickr.com/7413/b.jpg
003,http://farm4.staticflickr.com/3742/c.jpg

2.2 Example outputs:

Files in a folder, outputs, containg files like:

001.jpg
002.jpg
003.jpg

3. My Try:

I tried mainly in two styles.

1. Using the download tool's inner support

Take ariasc as an example, it support use -i option to import a file of URLs to download, and (I think) it will process it in parallel to max speed. It do have --force-sequential option to force download in the order of the lines, but I failed to find a way to make the naming part happen.

2. Splitting first

split the file into files and run a script like the following to process it:

#!/bin/bash
INPUT=$1

while IFS=, read serino url
do 
    aria2c -c "$url" --dir=outputs --out="$serino.jpg"
done < "$INPUT"

However, it means for each line it will restart aria2c again which seems cost time and low the speed. Though, one can run the script in bash command multiple times to get 'shell-level' parallelism, it seems not to be the best way.

Any suggestion ? Thank you,

Reference : CURL should help you .. https://stackoverflow.com/questions/16362402/save-file-to-specific-folder-with-curl-command — thar45, Feb 20 '19 at 07:00
See also [Can aria2c download list of urls with specific file names for each](https://stackoverflow.com/q/46102806/6770384). — Socowi, Feb 20 '19 at 08:02

Socowi · Accepted Answer · 2019-02-20T07:55:39.327

aria2c supports so called option lines in input files. From man aria2c

-i, --input-file=
Downloads the URIs listed in FILE. You can specify multiple sources for a single entity by putting multiple URIs on a single line separated by the TAB character. Additionally, options can be specified after each URI line. Option lines must start with one or more white space characters (SPACE or TAB) and must only contain one option per line.

and later on

These options have exactly same meaning of the ones in the command-line options, but it just applies to the URIs it belongs to. Please note that for options in input file -- prefix must be stripped.

You can convert your csv file into an aria2c input file:

sed -E 's/([^,]*),(.*)/\2\n  out=\1/' file.csv | aria2c -i -

This will convert your file into the following format and run aria2c on it.

http://farm6.staticflickr.com/5342/a.jpg
  out=001
http://farm8.staticflickr.com/7413/b.jpg
  out=002
http://farm4.staticflickr.com/3742/c.jpg
  out=003

However this won't create files 001.jpg, 002.jpg, … but 001, 002, … since that's what you specified. Either specify file names with extensions or guess the extensions from the URLs.

If the extension is always jpg you can use

sed -E 's/([^,]*),(.*)/\2\n  out=\1.jpg/' file.csv | aria2c -i -

To extract extensions from the URLs use

sed -E 's/([^,]*),(.*)(\..*)/\2\3\n  out=\1\3/' file.csv | aria2c -i -

Warning: This works if and only if every URL ends with an extension. For instance, due to the missing extension the line 001,domain.tld/abc would not be converted at all, causing aria2c to fail on the "URL" 001,domain.tld/abc.

Given the actual download speed, this was selected as the ans. The actual number of images are 960k, so when I direct put the file containing the URLs, the aria2c will crash. One can use a bash command, named `split`, to split the file into small ones by lines. `split -l/5 $FILE`, for instance. Then process them one by one. — Laurence W, Feb 28 '19 at 15:43

anubhava · Answer 2 · 2019-02-20T07:32:17.967

1

Using all standard utilities you can do this to download in parallel:

tr '\n' ',' < file.csv |
xargs -P 0 -d , -n 2 bash -c 'curl -s "$2" -o "$1.jpg"' -

-P 0 option in xargs lets it run commands in parallel (one per core processor)

edited Feb 20 '19 at 07:32

answered Feb 20 '19 at 07:31

anubhava

761,203
64
569
643

If the CSV is large you want to limit the number of parallel tasks, though. You will simply congest your network if you run more than a few dozen at the same time. – tripleee Feb 20 '19 at 07:32
I thought `-P 0` controls that. ˆt says in `man xargs` that `If max-procs is 0, xargs will run as many processes as possible at a time` – anubhava Feb 20 '19 at 07:33
That's how many the CPU will allow, but it will easily start more than your network can handle for a task that is I/O bound. – tripleee Feb 20 '19 at 07:34
I have been using `xargs -P 0` to run `curl` command on input files containing 400k-500k records, never ran into any clogging issues. – anubhava Feb 20 '19 at 07:37
The relationship is somewhat complex. 500 fetches with a fast network to a small number of sites with good connectivity will be fine; 500 fetches with a fast CPU to a large number of small sites with poor connectivity will congest your network. – tripleee Feb 20 '19 at 07:39