1

i have create small program consisting of a couple of shell scripts that work together, almost finished and everything seems to work fine, except for one thing of which i'm not really sure how to do.. which i need, to be able to finish this project...

there seem to be many routes that can be taken, but i just can't get there...

i have some curl results with lots of unused data including different links, and between all data there is a bunch of similar links i only need to get (into a variable) the link of the highest number (without the always same text)

the links are all similar, and have this structure:
<a href="https://always/same/link/same-name_19.html">always same text</a>
<a href="https://always/same/link/same-name_18.html">always same text</a>
<a href="https://always/same/link/same-name_17.html">always same text</a>

i was thinking about something like;

content="$(curl -s "$url/$param")"

linksArray= get from $content all links that are in the href section of the links
that contain "always same text"

declare highestnumber;

for file in $linksArray 
do

href=${1##*/}

fullname=${href%.html}

OIFS="$IFS"
IFS='_'
read -a nameparts <<< "${fullname}"
IFS="$OIFS"

if ${nameparts[1]} > $highestnumber;
then 
highestnumber=${nameparts[1]}
fi

done

echo ${nameparts[1]}_${highestnumber}.html

result:

https://always/same/link/unique-name_19.html

this was just my guess, any working code that can be run from bash script is oke... thanks...

update

i found this nice program, it is easily installed by:

# 64bit version
wget -O xidel/xidel_0.9-1_amd64.deb https://sourceforge.net/projects/videlibri/files/Xidel/Xidel%200.9/xidel_0.9-1_amd64.deb/download
apt-get -y install libopenssl
apt-get -y install libssl-dev
apt-get -y install libcrypto++9
dpkg -i xidel/xidel_0.9-1_amd64.deb

it looks awsome, but i'm not really sure how to tweak it to my needs.

based on that link and the below answer, i guess a possible solution would be..

  1. use xidel, or use "$ sed -n 's/.href="([^"]).*/\1/p' file" as suggested in this link, but then tweak it to get the link with html tags like:

    < a href="https://always/same/link/same-name_17.html">always same text< /a>

  2. then filter out all that doesn't end with ( ">always same text< /a> )
  3. and then use the grep sort as mentioned below.
Ricky
  • 11
  • 2
  • Why not something simple like `thelatest=$(grep -o 'https:.*[.]html' < <(curl -s "$url/$param") | sort | tail -n1)` ? You can adjust the specificity of the `grep` regular expression as needed. – David C. Rankin Apr 21 '16 at 01:01

2 Answers2

3

Continuing from the comment, you can use grep, sort and tail to isolate the highest number of your list of similar links without too much trouble. For example, if you list of links is as you have described (I've saved them in a file dat/links.txt for the purpose of the example), you can easily isolate the highest number in a variable:

Example List

$ cat dat/links.txt
<a href="https://always/same/link/same-name_19.html">always same text</a>
<a href="https://always/same/link/same-name_18.html">always same text</a>
<a href="https://always/same/link/same-name_17.html">always same text</a>

Parsing the Highest Numbered Link

$ myvar=$(grep -o 'https:.*[.]html' dat/links.txt | sort | tail -n1); \
echo "myvar : '$myvar'"
myvar : 'https://always/same/link/same-name_19.html'

(note: the command above is all one line separate by the line-continuation '\')

Applying Directly to Results of curl

Whether your list is in a file, or returned by curl -s, you can apply the same approach to isolate the highest number link in the returned list. You can use process substitution with the curl command alone, or you can pipe the results to grep. E.g. as noted in my original comment,

$ myvar=$(grep -o 'https:.*[.]html' < <(curl -s "$url/$param") | sort | tail -n1); \
echo "myvar : '$myvar'"

or pipe the result of curl to grep,

$ myvar=$(curl -s "$url/$param" | grep -o 'https:.*[.]html' | sort | tail -n1); \
echo "myvar : '$myvar'"

(same line continuation note.)

David C. Rankin
  • 81,885
  • 6
  • 58
  • 85
  • interesting post, when i try your example everything works fine, but when combining with curl, i just can't get it to work... the only thing that makes sense (to me) is if grep is a line by line tool. and that the curled html page has many lines, and then the grep finds the correct line and then find the first instance of "https: and the last instance of "html" and displays both with everything in between. – Ricky Apr 24 '16 at 03:56
  • What shell are you using. The process substitution (e.g. `< <(curl -s "$url/$param")`) will not work in POSIX shell, but the pipes version should. Also, variations in what is returned by any specific URL may require tweaks to the regular expression (especially if there is more than one `.html` in a line) – David C. Rankin Apr 24 '16 at 04:21
  • i'm not sure if it answers your question but i'm using Debian Jessie with default shell. i tried both, but coincidentally mostly experimented with the piped version. p.s. i updated my question. – Ricky Apr 24 '16 at 11:23
1

Why not use Xidel with xquery to sort the links and return the last?

xidel -q links.txt --xquery "(for $i in //@href order by $i return $i)[last()]" --input-format xml

The input-format parameter makes sure you don't need any html tags at the start and ending of your txt file.

If I'm not mistaken, in the latest Xidel the -q (quiet) param is replaced by -s (silent).

MatrixView
  • 311
  • 2
  • 7
  • when placing the 3 links given in my question into links.txt and running the given line, im gettin an Error: err:XPST0003: Unknown or unexpected operator: in in: (for [<- error occurs before here] in //@href order by return )[last()] – Ricky Apr 26 '16 at 19:26
  • If you use Linux, swap double quotes for single. – MatrixView Apr 27 '16 at 14:15