0

I want to download all files from this section of a HTML page :

    <td><a class="xm" name="item_1" type="dd" href="/data/24765/dd">Item 1</a></td>
    <td><a class="xm" name="item_2" type="dd" href="/data/12345/dd">Item 2</a></td>
    <td><a class="xm" name="item_3" type="dd" href="/data/75239/dd">Item 3</a></td>

The download link for the first file is https://foo.bar/data/24765/dd, and as it's a zip file, I'd like to unzip it as well.

My script is this :

#!/bin/bash
curl -s "https://foo.bar/path/to/page" > data.html

gawk 'match($0, /href="\/(data\/[0-9]{5}\/dd)"/, m){print m[1]}' data.html > data.txt

for f in $(cat data.txt); do 
    curl -s "https://foo.bar/$f" > data.zip
    unzip data.zip
done

Is there a more elegant way to write this script? I'd like to avoid saving the html, txt and zip files.

macxpat
  • 173
  • 2
  • 11
  • 1
    Please [Don't Parse XML/HTML With Regex.](https://stackoverflow.com/a/1732454/3776858). I suggest to use an XML/HTML parser (xmlstarlet, xmllint ...). – Cyrus Jan 22 '22 at 17:52
  • Yeah, I was expecting that one. The reader should also know about the [answer](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1733489#1733489). – macxpat Jan 23 '22 at 00:31

2 Answers2

1

The bsdtar command can unzip archives from stdin, allowing you to do this:

curl -s "https://foo.bar/$f" | bsdtar -xf-

And of course you can pipe the first curl command directly into awk:

curl -s "https://foo.bar/path/to/page" |
gawk 'match($0, /href="\/(data\/[0-9]{5}\/dd)"/, m){print m[1]}' > data.txt

And in fact you might as well just pipe the output of that pipeline directly into a loop:

curl -s "https://foo.bar/path/to/page" |
gawk 'match($0, /href="\/(data\/[0-9]{5}\/dd)"/, m){print m[1]}' |
while read archive; do
    curl -s "https://foo.bar/$archive" | bsdtar -xf-
done
larsks
  • 277,717
  • 41
  • 399
  • 399
0

I'd like to avoid saving(...)zip files.

Generally numerous linux terminal commands will accept - meaning use stdin where filename is required. After cursory search it appears that certain versions of unzip does not support this (see How to redirect output of wget as input to unzip? at unix.stack.exchange) whilst others like one described by freebsd.org do

If specified filename is "-", then data is read from stdin.

So if version you are using do that then

curl -s "https://foo.bar/$f" > data.zip
unzip data.zip

can be ameloriated to

curl -s "https://foo.bar/$f" > unzip -

If it does not, yet you want to use unzip then according to answer from unix.stack.exchange prefixing unzip using busybux will repair that is

curl -s "https://foo.bar/$f" > busybux unzip -
Daweo
  • 31,313
  • 3
  • 12
  • 25