Find latest link from a HTML page listing download locations

Question

I'm trying to build an equivalent to the following github-specific code that works for finding the latest artifact available for download from https://runtime.fivem.net/artifacts/fivem/build_proot_linux/master -- the download links look something like https://runtime.fivem.net/artifacts/fivem/build_proot_linux/master/5901-5db768d8bbb973ba27c81e424aea2910144a3100/fx.tar.xz.

# Working code for github.com, needs to be converted to fivem.net
LOCATION=$(curl -s https://api.github.com/repos/someuser/somerepo/releases/latest \
| grep "tag_name" \
| awk '{print "https://github.com/someuser/somerepo/archive/" substr($2, 2, length($2)-3) ".zip"}') \
; curl -L -o file.zip $LOCATION

The file has an incremental version number but not a sequential number, followed by a completely random hash.

How can I find the latest download link from the HTML page at https://runtime.fivem.net/artifacts/fivem/build_proot_linux/master?

We know github's api, but we can't know example.com's api. The appropriate mechanism is specific to the server at hand, so a specific server (or at least a specific API) is needed to make this answerable. — Charles Duffy, Sep 25 '22 at 15:47
...unless you're given the URL as an input, and just want to extract pieces of that URL into separate variables? That can be as simple as a regex match (see `[[ $string =~ $regex ]]` and `BASH_REMATCH`) -- but please be clear if that is in fact the case. — Charles Duffy, Sep 25 '22 at 16:07
@CharlesDuffy Thank you for your reply. The link I need to put into my script will be like this (It's the latest version of the file I use in my docker image): https://runtime.fivem.net/artifacts/fivem/build_proot_linux/master/5901-5db768d8bbb973ba27c81e424aea2910144a3100/fx.tar.xz What i'm looking for is a away to read this page: https://runtime.fivem.net/artifacts/fivem/build_proot_linux/master/ and get at least the "Last Recommended" version using curl or wget to get the file. — Mandatti, Sep 25 '22 at 16:38
Have you looked at whether they offer a non-HTML API? We absolutely _can_ do what you asked for, but the problem with parsing HTML is that because it's meant for humans and browsers, it can be changed in ways that still let humans understand it but break automated parsing; if the `artifacts/fivem/build_proot_linux/master` page can be requested with something like `Content-Type: application/json` to get a JSON format, that's going to be less likely to break when the site undergoes a redesign. — Charles Duffy, Sep 25 '22 at 16:41
Anyhow -- if I'm not free to write my own answer for a few minutes, some pointers to get started: [easiest way to extract the urls from a html page using sed or awk only](https://stackoverflow.com/questions/1881237/easiest-way-to-extract-the-urls-from-an-html-page-using-sed-or-awk-only) gives you pointers on getting the links out; once you have them you can parse them. (I strongly advise _against_ trying to use sed or awk only, but there are answers there using better, HTML-aware tools like lynx, xidel, etc). — Charles Duffy, Sep 25 '22 at 16:45
Seems that they don't have any API's that we can use in this case. The do have a Github page, but they don't release versioned binaries as do txAdmin (the file I'm able to get over Github's API) only source code, which increase the workload, size and deployment time for the container. — Mandatti, Sep 25 '22 at 16:52
Also relevant is https://forum.cfx.re/t/latest-artifacts-download/1110313, demonstrating how to calculate the download URL based on the github tag and commitish. — Charles Duffy, Sep 25 '22 at 17:24
(also, note that powershell-core _is_ available for Linux, so the above powershell solution should work for you even when not on Windows). — Charles Duffy, Sep 25 '22 at 17:34

score 1 · Answer 1 · answered Oct 01 '22 at 21:10

If you want to parse HTML with a command-line tool, then I suggest you take a look at a proper HTML-parser like xidel:

$ xidel -s "https://runtime.fivem.net/artifacts/fivem/build_proot_linux/master/" \
  -e '//a[@class="panel-block  is-active"][1]/@href'
./5914-b600ff018d939f6a65e48994bf4a4192388435e7/fx.tar.xz

In addition, there's no need for a Bash-script or any other tool, because with --follow/-f and --download you can download the file right away:

$ xidel -s "https://runtime.fivem.net/artifacts/fivem/build_proot_linux/master/" \
  -f '//a[@class="panel-block  is-active"][1]/@href' \
  --download .

This downloads 'fx.tar.xz' in the current dir. I wouldn't recommend manually entering 'file.zip' when the extension is "xz" in the first place. You could however generate a more appropriate filename:

$ xidel "https://runtime.fivem.net/artifacts/fivem/build_proot_linux/master/" \
  -f '//a[@class="panel-block  is-active"][1]/@href)' \
  --download 'artifacts-{extract($url,"master/(\d+)-",1)}.tar.xz'
Retrieving (GET): https://runtime.fivem.net/artifacts/fivem/build_proot_linux/master/
Processing: https://runtime.fivem.net/artifacts/fivem/build_proot_linux/master/
Retrieving (): https://runtime.fivem.net/artifacts/fivem/build_proot_linux/master/5914-b600ff018d939f6a65e48994bf4a4192388435e7/fx.tar.xz
Save as: artifacts-5914.tar.xz

This downloads 'artifacts-5914.tar.xz' in the current dir. And when you leave out --silent/-s, you'll get to see these log-messages. Btw, I don't know this software, so I assumed it's called "artifacts".

score 0 · Accepted Answer · answered Sep 25 '22 at 17:31

We can build off the use of lynx dump, as suggested in Easiest way to extract the urls from an html page using sed or awk only --

#!/usr/bin/env bash

url_re='https://runtime.fivem.net/artifacts/fivem/build_proot_linux/master/([[:digit:]]+)-([[:xdigit:]]+)/fx.tar.xz'
newest_link_num=0
newest_link_content=
while read -r _ link; do
  [[ $link =~ $url_re ]] || continue
  if (( ${BASH_REMATCH[1]} > newest_link_num )); then
    newest_link_num=${BASH_REMATCH[1]}
    newest_link_content=$link
  fi
done < <(lynx -dump -listonly -hiddenlinks=listonly https://runtime.fivem.net/artifacts/fivem/build_proot_linux/master)

echo "Newest link is: $newest_link_content"

As of this writing, it finishes with the following output:

Newest link is: https://runtime.fivem.net/artifacts/fivem/build_proot_linux/master/5901-5db768d8bbb973ba27c81e424aea2910144a3100/fx.tar.xz

It works like a charm. Thank you sir. Since I can have the link I can also move forward with my project. Kudos! — Mandatti, Sep 25 '22 at 17:55

score 0 · Answer 3 · answered Sep 26 '22 at 08:16

I examined https://runtime.fivem.net/artifacts/fivem/build_proot_linux/master/ and latest links (version 5902 i.e. newest and version 5484 i.e. latest recommended) seems to have is-active class

<a class="panel-block  is-active" href="./5902-3c88d7752be75493078c1da898337b0abc2652ff/fx.tar.xz" style="display: block;">

as opposed to older versions. If possible you should use tools designed for working with HTML for dealing with HTML for example hxselect, however if you are not allowed to install such tools you might GNU AWK instead following way

wget -O - https://runtime.fivem.net/artifacts/fivem/build_proot_linux/master/ | awk 'BEGIN{RS="<|>"}/is-active/{sub(/^.*href="\./,"");sub(/".*/,"");print "https://runtime.fivem.net/artifacts/fivem/build_proot_linux/master"$0}'

to get output

https://runtime.fivem.net/artifacts/fivem/build_proot_linux/master/5902-3c88d7752be75493078c1da898337b0abc2652ff/fx.tar.xz
https://runtime.fivem.net/artifacts/fivem/build_proot_linux/master/5848-4f71128ee48b07026d6d7229a60ebc5f40f2b9db/fx.tar.xz

Explanation: I inform GNU AWK that row separator (RS) is < or > so inside of starting and ending tag are treated as single row, then for row which contain is-active I replace everything up to href=". with empty string, i.e. delete it and then replace " and all behind it using empty string, i.e. delete it, then print contatenation of https://runtime.fivem.net/artifacts/fivem/build_proot_linux/master and extracted href's value.

(tested in gawk 4.2.1)

Find latest link from a HTML page listing download locations

3 Answers3