-1

I have a website containing several dozen hyperlinks in the following format :

<a href=/news/detail/1/hyperlink>textvalue</a>

I want to get all hyperlinks, and their text values, where the hyperlink begins with /news/detail/1/.

The output should be in the following format :

textvalue
/news/detail/1/hyperlink
vefaxa
  • 3
  • 1
  • https://stackoverflow.com/questions/25358698/parse-html-using-shell – Maroun Oct 12 '19 at 13:49
  • The title of your question doesn't make much sense. It's like asking "what sunglasses should I get using a rowboat?" Obviously, no matter how you cut it (and I'd take Maroun's advice *very seriously*), you're going to be using some other program. Whether you run that program from bash or csh or zsh or whatever other shell there may be -- that is neither here nor there. –  Oct 12 '19 at 17:26

1 Answers1

0

First of all, people are going to come in here (possibly talking about someone named Cthuhlu) and tell you that awk/regex are not HTML parsers. And they are right, and you should give some thought to what they say. Realistically, you can very often get away with something like this:

sed -n 's/^.*<a\s\+href\=\([^>]\+\)>\([^<]\+\)<\/a>.*$/\2\n\1/p' input_file.html

This tells sed to read the file input_file.html, find lines that match the regex, replace them with the sections you specified for the output, and discard everything else. The result will print to the terminal.

This also assumes that the file is formatted such that each instance of <a href=/news/detail/1/hyperlink>textvalue</a> is on a separate line. The regex could easily be modified to accommodate different formatting, if needed.

If all of the links you want happen to start with /news/detail/1/, this will probably work:

sed -n 's/^.*<a\s\+href\=\(\/news\/detail\/1\/[^>]\+\)>\([^<]\+\)<\/a>.*$/\2\n\1/p' input_file.html
Z4-tier
  • 7,287
  • 3
  • 26
  • 42