Bash: Download a .zip from a dynamic HTML page

Question

I've created an ugly one-liner which works, but I would like to make it simpler and easier for others to read. It is being used in a dockerfile which is used as a script to build an image that will be run with Docker.

curl -s -L http://www.nxfilter.org/|grep Download|sed -e 's/<a /\n<a /g'|;
sed -e 's/<a .*href=['"'"'"]//' -e 's/["'"'"'].*$//' -e '/^$/ d'|;
xargs -n1 curl -s -L|grep zip|sed -e 's/<a /\n<a /g'|;
sed -e 's/<a .*href=['"'"'"]//' -e 's/["'"'"'].*$//' -e '/^$/ d'|;
grep -v dropbox|grep -v logon|grep -v cloud|grep zip

or without manual line breaks

curl -s -L http://www.nxfilter.org/|grep Download|sed -e 's/<a /\n<a /g'|sed -e 's/<a .*href=['"'"'"]//' -e 's/["'"'"'].*$//' -e '/^$/ d'|xargs -n1 curl -s -L|grep zip|sed -e 's/<a /\n<a /g'|sed -e 's/<a .*href=['"'"'"]//' -e 's/["'"'"'].*$//' -e '/^$/ d'|grep -v dropbox|grep -v logon|grep -v cloud|grep zip

Step 1: visit nxfilter.org and follow redirects to get www.nxfilter.org/p2/index.html
Step 2: parse the homepage HTML for the URL for the Download page www.nxfilter.org/p2/?page_id=93 (it's a blog type site and the page could change in the future)
Step 3: parse the Download page HTML for the URL to nxfilter*.zip which is currently http://nxfilter.org/download/nxfilter-3.0.5.zip
Step 4: download as nxfilter.zip
Step 5: the Dockerfile continues executing commands to set up the environment where NxFilter will run in the final Docker container.

Surely there is a simpler way to get that URL for the .zip

Easiest way to extract the urls from an html page using sed or awk only

RegEx match open tags except XHTML self-contained tags

http://www.unix.com/unix-for-dummies-questions-and-answers/142627-cut-field-line-having-quotes-delimiter.html

wget or curl from stdin

You should stay with pipes but I would split it in multiple lines for better readability — webdeb, Nov 17 '15 at 03:16
I feel like people normally deal with this problem by making a symlink to the latest build and name it `nxfilter-latest.zip`, for example. But I assume you aren't the provider of this nxfilter file. — OneCricketeer, Nov 17 '15 at 03:17
Exactly, I am repackaging someone else's application to be used as a Docker image. My main purpose is to allow my script to go find the latest package every time it is called. I will be using IFTTT or Zapier to make a webhook call to Docker Hub every time the Downloads page is updated. Automating my current workflow of having IFTTT watch a Page2RSS feed of the Download page on nxfilter's site and email me when it changes. Then I go manually grab the .zip URL and paste it into the Dockerfile sitting in GIT which Docker hub automatically builds when I make a commit. — cron410, Nov 17 '15 at 03:22
Use [this](http://www.nxfilter.org/download.php) URL instead perhaps? — Etan Reisner, Nov 17 '15 at 03:45
have a look at xml parsing commands like `xmlstarlet` or `xmllint`. `html` is _based on_ `xml`, but it not strictly xml code on many websites (many unclosed tags)... AFAIR, `xmllint` has some option for supporting such html code. — anishsane, Nov 17 '15 at 04:28
Looks like a much better idea Etan. I completely zonked on not seeing that URL in the google results or the link to download old versions. My new problem is that the Download.php page only gives relational URLs (I think that's what they are called?) so I need to concatenate the base URL (http://www.nxfilter.org/) and the path to the file (download/nxfilter-3.0.5.exe) but that's for another discussion. It was a fun exercise, but parsing the download.php page is much cleaner! — cron410, Nov 18 '15 at 21:08
The author of the website ended up adding a simple PHP script to his site so I can parse that to get the latest version number and construct a download URL which is a much, much cleaner and simpler solution. About a year ago he removed nxfilter packages from the download.php page. — cron410, Sep 04 '20 at 15:26

score 0 · Accepted Answer · answered Nov 18 '15 at 21:34

Looks like the answer is to parse the downloads.php page for the URL with:

curl -sL nxfilter.org/download.php | grep nxfilter |;
tail -n1|sed -e 's/<a .*href=['"'"'"]//' -e 's/["'"'"'].*$//' -e '/^$/ d'|tr -d '[:blank:]'

It's still pretty ugly, but much shorter than my original string of commands.

Bash: Download a .zip from a dynamic HTML page

1 Answers1