Scraping specific hyperlinks from a website using bash

Question

I have a website containing several dozen hyperlinks in the following format :

<a href=/news/detail/1/hyperlink>textvalue</a>

I want to get all hyperlinks, and their text values, where the hyperlink begins with /news/detail/1/.

The output should be in the following format :

textvalue
/news/detail/1/hyperlink

https://stackoverflow.com/questions/25358698/parse-html-using-shell — Maroun, Oct 12 '19 at 13:49
The title of your question doesn't make much sense. It's like asking "what sunglasses should I get using a rowboat?" Obviously, no matter how you cut it (and I'd take Maroun's advice *very seriously*), you're going to be using some other program. Whether you run that program from bash or csh or zsh or whatever other shell there may be -- that is neither here nor there. — , Oct 12 '19 at 17:26

Z4-tier · Accepted Answer · 2019-10-12T15:46:44.053

First of all, people are going to come in here (possibly talking about someone named Cthuhlu) and tell you that awk/regex are not HTML parsers. And they are right, and you should give some thought to what they say. Realistically, you can very often get away with something like this:

sed -n 's/^.*<a\s\+href\=\([^>]\+\)>\([^<]\+\)<\/a>.*$/\2\n\1/p' input_file.html

This tells sed to read the file input_file.html, find lines that match the regex, replace them with the sections you specified for the output, and discard everything else. The result will print to the terminal.

This also assumes that the file is formatted such that each instance of <a href=/news/detail/1/hyperlink>textvalue</a> is on a separate line. The regex could easily be modified to accommodate different formatting, if needed.

If all of the links you want happen to start with /news/detail/1/, this will probably work:

sed -n 's/^.*<a\s\+href\=\(\/news\/detail\/1\/[^>]\+\)>\([^<]\+\)<\/a>.*$/\2\n\1/p' input_file.html

I don't want to replace the contents in the file. I just want to "find" them, and display them in the format I've described in the question, in the standard output on the terminal. — vefaxa, Oct 12 '19 at 15:10
Thanks for the link. I'll read up. Although, the solution you've posted, is almooost the one I want. It's just that it processes all tags. If it could work only on the tags where the `href` attribute starts with the value `/news/detail/1/`, it would be perfect! — vefaxa, Oct 12 '19 at 15:42

Scraping specific hyperlinks from a website using bash

1 Answers1