Shell script - remove all before and after

Question

Find the next link if the Link header contains rel=next.. Getting the link header can result in different strings.. I need to find the next link. e.g.

Link: <http://mygithub.com/api/v3/organizations/20/repos?page=1>; rel=prev, <http://mygithub.com/api/v3/organizations/20/repos?page=3>; rel=next, <http://mygithub.com/api/v3/organizations/20/repos?page=4>; rel=last, <http://mygithub.com/api/v3/organizations/20/repos?page=1>;

would be http://mygithub.com/api/v3/organizations/20/repos?page=3

Link: <http://mygithub.com/api/v3/organizations/4/repos?page=2>; rel="next", <http://mygithub.com/api/v3/organizations/4/repos?page=2>; rel="last"

would be http://mygithub.com/api/v3/organizations/4/repos?page=2

Played with sed and parameter expansion - not that experienced so got stuck :)

"Shell" meaning you need to be compatible with `/bin/sh`, or is this running in bash, ksh, zsh, or another extended shell? If you're in a shell with native regex support, you should consider using that. — Charles Duffy, Oct 30 '20 at 17:48
See the answers using `BASH_REMATCH` in [extract substring using regexp in plain bash](https://stackoverflow.com/questions/13373249/extract-substring-using-regexp-in-plain-bash/13373256). Using `sed` is generally best avoided when you're running it with only one line of input per invocation -- it takes a lot of time to start up each copy, even though it's quite fast once it's running. — Charles Duffy, Oct 30 '20 at 17:49
@shellter thanks. One questions.. how can I assign the value to a variable in the shell script. e.g. I have the string with the links in a variable names nextReposLink `echo $nextReposLink`. - prints the string with mygithub links I want to save the result of the command in a new variable... `$nextReposLink | awk '{for (i=0; i<=NF; i++){if ($i == "rel=next,"){print $(i-1);exit}}}' | sed -e 's/ /' -e 's/>;/ /'` Something like, but that gives me a "bad substitution" `x="${echo $nextReposLink | awk '{for (i=0; i<=NF; i++){if ($i == \"rel=next,\"){print $(i-1);exit}}}'}"` — klind, Nov 02 '20 at 23:58

score 0 · Answer 1 · answered Oct 30 '20 at 17:43

Well - I put one of your URL strings in a text file and was able to pull out the first URL with two cuts.

[root@oelinux2 ~]# cat test
Link: <http://mygithub.com/api/v3/organizations/20/repos?page=1>; rel=prev, <http://mygithub.com/api/v3/organizations/20/repos?page=3>; rel=next, <http://mygithub.com/api/v3/organizations/20/repos?page=4>; rel=last, <http://mygithub.com/api/v3/organizations/20/repos?page=1>;

Then with using cut:

cat test | cut -d "<" -f2 | cut -d ">" -f1


[root@oelinux2 ~]# cat test | cut -d "<" -f2 | cut -d ">" -f1
http://mygithub.com/api/v3/organizations/20/repos?page=1

That's one option - if you are just looking to get the first URL in the string. Basically - that's just grabbing what's between the two delimiters "<" and ">"

With Cut: -d is the 'delimiter' -f is the field you want to get.

If you wanted to get a later URL in that string, you could change the fields (-f #) and see what you get :)

the next link will not always be in the same spot. As you can see sometimes the prev comes first. It like I have to find the string 'rel="next"' and then go backwards from there finding the first > and then the < and take what is between. — klind, Oct 30 '20 at 18:02
Oh ya.. see that - perhaps Charles Duffy in the reply to your OP using Regex might be best there.. Because cut and awk are pretty much dependent on using a positional field. I'm sure you could accomplish it with the right regex statement - but I am no real regex pro.. — Overcast, Oct 30 '20 at 18:59

score 0 · Accepted Answer · answered Nov 03 '20 at 03:43

Please be aware that parsing HTML with non-html tools it fraught with peril; you will see that this works, and assume you can get away with it always. You'll spend hours trying to get the next level of complexity to work, when you should be studying how to use html-aware tools. Don't say we didn't warn you (-;, but

printf "<http://mygithub.com/api/v3/organizations/20/repos?page=1>; rel=prev, <http://mygithub.com/api/v3/organizations/20/repos?page=3>; rel=next, <http://mygithub.com/api/v3/organizations/20/repos?page=4>; rel=last, <http://mygithub.com/api/v3/organizations/20/repos?page=1>;\n" \
| awk -F" " '{
    for(i=1;i<=NF;i++){
       if ($i == "rel=next,") {
         gsub(/[<>]/,"",$(i-1);sub(/;$/,"",$(i-1))
         print $(i-1)
       }
    }
}'

produces required output:

http://mygithub.com/api/v3/organizations/20/repos?page=3

To save the output of a script section into a variable, you wrap the code for command-substitution, in this case

 nextReposLink=$( printf .... | awk '....' )
 #-------------^^--------------------------^

The ^ pointed items are modern syntax for command-substitution. The code inside of $( ... ) is executed and the standard output is passed as a argument to the invoking command line. (The original syntax for command substitution is/was `cmds` and works the same in the simple case var=`cmds` . You can nest modern cmd-substitution easily, whereas the old version requires a lot of escape character fiddling. Avoid it if you can.

Note that about any s/str/rep/ that sed can do, awk can do the same, but requires the use of the sub(/regx/, "repl", "str") or gsub(sameArgs) functions. In this particular case, you may need to escape the <> like \<\>.

Be sure to always dbl-quote the use of variables, i.e. echo "$nextReposLink".

IHTH

Shell script - remove all before and after

2 Answers2