I want to download all files from this section of a HTML page :
<td><a class="xm" name="item_1" type="dd" href="/data/24765/dd">Item 1</a></td>
<td><a class="xm" name="item_2" type="dd" href="/data/12345/dd">Item 2</a></td>
<td><a class="xm" name="item_3" type="dd" href="/data/75239/dd">Item 3</a></td>
The download link for the first file is https://foo.bar/data/24765/dd
, and as it's a zip file, I'd like to unzip it as well.
My script is this :
#!/bin/bash
curl -s "https://foo.bar/path/to/page" > data.html
gawk 'match($0, /href="\/(data\/[0-9]{5}\/dd)"/, m){print m[1]}' data.html > data.txt
for f in $(cat data.txt); do
curl -s "https://foo.bar/$f" > data.zip
unzip data.zip
done
Is there a more elegant way to write this script? I'd like to avoid saving the html, txt and zip files.