First of all, people are going to come in here (possibly talking about someone named Cthuhlu) and tell you that awk/regex are not HTML parsers. And they are right, and you should give some thought to what they say. Realistically, you can very often get away with something like this:
sed -n 's/^.*<a\s\+href\=\([^>]\+\)>\([^<]\+\)<\/a>.*$/\2\n\1/p' input_file.html
This tells sed
to read the file input_file.html
, find lines that match the regex, replace them with the sections you specified for the output, and discard everything else. The result will print to the terminal.
This also assumes that the file is formatted such that each instance of <a href=/news/detail/1/hyperlink>textvalue</a>
is on a separate line. The regex could easily be modified to accommodate different formatting, if needed.
If all of the links you want happen to start with /news/detail/1/
, this will probably work:
sed -n 's/^.*<a\s\+href\=\(\/news\/detail\/1\/[^>]\+\)>\([^<]\+\)<\/a>.*$/\2\n\1/p' input_file.html