find links with regex

Question

I am currently trying to learn Linux commands and regular expressions, I am stuck on a little problem that I have trying to find a series of links within a file using sed and regular expressions, can anyone help me work this out and where I am going wrong. The links are something like this

<a href="../a-lot-of-different/words-that/should-link.html">Useful links</a>
<a href="..//a-lot-of-different/words-that/should-find-lots-of-links.html">Multiple links</a>
<a href="../another-word-and-links/multiple-words/sjshfi-dfg.html">more links</a>

This is what I have.

sed -n '/<a*href=”^[../"]*\([a-z]*\)^[.html](["]*\)/p' /file > newfile

If it's an HTML file, I recommend using a DOM parser. See http://unix.stackexchange.com/questions/6389/parse-html-on-linux and http://stackoverflow.com/questions/893585/how-to-parse-xml-in-bash — Phil, Oct 29 '14 at 23:31

score 0 · Accepted Answer · answered Oct 29 '14 at 23:44

Regular expressions are less than ideal for parsing HTML.

You didn't show your desired output. I am guessing that you want to extract the links. if so, try:

$ sed -rn 's/.*<a\s+href="([^"]*)".*/\1/p' file
../a-lot-of-different/words-that/should-link.html
..//a-lot-of-different/words-that/should-find-lots-of-links.html
../another-word-and-links/multiple-words/sjshfi-dfg.html

How it works:

.*<a\s+href="

This matches everything before the link.
([^"]*)

This matches the link and captures it into group \1.
".*

This matches the double-quote after the line and everything that follows.

Thank you for this, it has made it a lot clearer and it has found one of the links that I am looking for. — knowlage, Oct 30 '14 at 00:37

score 0 · Answer 2 · answered Oct 29 '14 at 23:52

0

As anchor tag contain href tag, so searching href will solve the problem

sed -n '/href=".*"/p' link_file.txt

answered Oct 29 '14 at 23:52

Hackaholic

19,069
5
54
72

find links with regex

2 Answers2