0

I'm wondering how I can extract the contents of a hyperlink in HTML,

For instance:

<article id="post36">
                <div>
                    <h3><a href="/blog/2019/4-14-canaries-in-the-coal-mine.html">Canaries in the Coal Mine</a></h3>
                    <p class="author">Posted by <a href="/blog/authors/moderator.html" rel="author">Moderator</a></p>
                    <p><time><span>Sunday, April 14th, 2019</span> &mdash; 8:17AM</time></p>
                </div>

Other posts look like this (no external page):

<article id="post33">
                <div>
                    <h3><a href="#post33">Landlines Win Again</a></h3>
                    <p class="author">Posted by <a href="/blog/authors/moderator.html" rel="author">Moderator</a></p>
                    <p><time><span>Friday, December 21st, 2018</span> &mdash; 7:14AM</time></p>

In an external script, I am passed the ID of a particular post. In this case, post 36 is below. I have a page containing all the post metadata in article tags like below.

I tried using catting the webpage (I have a local copy) and piping it to sed -n 's|[^<]*<article\([^<]*\)</article>[^<]*|\1\n|gp'

That sort of works. It only returns all of the article ids, like this:

<article id="post6">
<article id="post5">
<article id="post4">
<article id="post3">
<article id="post2">
<article id="post1">

My conclusion is that it only works on the current line. And when I try actually using the ID I get nothing: sed -n 's|[^<]*<article id="post36">\([^<]*\)</article>[^<]*|\1\n|gp'

My question here is how can I take advantage of the built-in Unix tools (sed, grep, awk, etc.) to extract the hyperlink? In this case, what I need is /blog/2019/4-14-canaries-in-the-coal-mine.html

Yes, I have consulted a number of SO posts like this one and this one, most of which discourage this kind of thing (I tried the native solutions but none worked). Two things:

  1. The HTML is formatted nicely. There will never be any extra white spaces, carriage returns, or anything else in the code. The blocks will always look like that. This is a very specific application.
  2. Unless it is actually impossible to do this without some kind of add on or external program, I'd like to stick with basic Unix tools.
InterLinked
  • 1,247
  • 2
  • 18
  • 50

1 Answers1

1

You can single the interesting line with sed addresses. In this case, a regexp pattern to match the <a href

sed -nre '/h3.*href.*(#post[0-9]+|\/blog\/)/ s/.*<a href="([^"]+)".*/\1/p' test.html 
/blog/2019/4-14-canaries-in-the-coal-mine.html
#post33

To match by article id add this in front of the sed command

grep -A3 'article id="post36"' test.html | sed -nre '/h3.*href.*(#post[0-9]+|\/blog\/)/ s/.*<a href="([^"]+)".*/\1/p'
LMC
  • 10,453
  • 2
  • 27
  • 52
  • Hey, that helps a lot! But now I need to narrow it down by blog ID, that returns all the hyperlinks for every blog. I don't see "article" in your sed anywhere. Furthermore, some blog posts don't have a separate page, and the hyperlink is of format "#ref", and I don't see these showing up, only the ones that have a separate page. Is it possible to fix that? – InterLinked Jul 11 '19 at 20:11
  • Yeah, if you provide the proper examples. – LMC Jul 11 '19 at 20:12
  • OK, updated with the other kind of post. I would basically call this in a shell script with the blog ID # – InterLinked Jul 11 '19 at 20:14
  • You could, I will leave that exercise to you ;-) – LMC Jul 11 '19 at 20:15
  • No, I know how to do that, I was just saying that's how I call it – InterLinked Jul 11 '19 at 20:16
  • I think I made a simple syntax error. I know I'm grepping the file then piping to sed, but why does grep still think I need a file? `grep -A3 /folder/path.html 'article id="post36"' | sed -nre '/h3.*href.*\/blog\// s/.* – InterLinked Jul 11 '19 at 20:16
  • It is not returning anything. My full line in my script is link=`grep -A3 'article id="post$1"' /home/com/interlinked/blog.html | sed -nre '/h3.*href.*(#post[0-9]+|\/blog\/)/ s/.* – InterLinked Jul 11 '19 at 20:37
  • Try the `grep` only on command line and make sure $1 is correct on your script, should be a number only. – LMC Jul 11 '19 at 20:41
  • Works perfectly from the command line, but I'm wondering single and double quoting how to get the $1 working right...? – InterLinked Jul 11 '19 at 20:44
  • Hmm... link=`grep -A3 "article id=\"post$1\"" /home/com/interlinked/blog.html | sed -nre '/h3.*href.*(#post[0-9]+|\/blog\/)/ s/.* – InterLinked Jul 12 '19 at 02:13
  • `grep -A3 'article id="post'"$1"'"'` or `grep -A3 'article id="post'$1'"'` or `patt=$(printf 'article id="post%d"' $1); grep -A3 "$patt" `. BTW, remember the shebang at the beginning of the script `#!/bin/bash`. – LMC Jul 12 '19 at 02:18
  • I've got the shebang, let me try those out – InterLinked Jul 12 '19 at 02:26
  • 1
    The last one works the best. Now it is returning a URL for each one. Not necessarily the right URL. It is returning ones that start with # but I can't figure out what's gone wrong... I'll take a closer look and see what the problem is in the morning. Thanks! – InterLinked Jul 12 '19 at 02:30