2

I am currently trying to learn Linux commands and regular expressions, I am stuck on a little problem that I have trying to find a series of links within a file using sed and regular expressions, can anyone help me work this out and where I am going wrong. The links are something like this

<a href="../a-lot-of-different/words-that/should-link.html">Useful links</a>
<a href="..//a-lot-of-different/words-that/should-find-lots-of-links.html">Multiple links</a>
<a href="../another-word-and-links/multiple-words/sjshfi-dfg.html">more links</a>

This is what I have.

sed -n '/<a*href=”^[../"]*\([a-z]*\)^[.html](["]*\)/p' /file > newfile
Phil
  • 157,677
  • 23
  • 242
  • 245
knowlage
  • 23
  • 2
  • If it's an HTML file, I recommend using a DOM parser. See http://unix.stackexchange.com/questions/6389/parse-html-on-linux and http://stackoverflow.com/questions/893585/how-to-parse-xml-in-bash – Phil Oct 29 '14 at 23:31

2 Answers2

0

Regular expressions are less than ideal for parsing HTML.

You didn't show your desired output. I am guessing that you want to extract the links. if so, try:

$ sed -rn 's/.*<a\s+href="([^"]*)".*/\1/p' file
../a-lot-of-different/words-that/should-link.html
..//a-lot-of-different/words-that/should-find-lots-of-links.html
../another-word-and-links/multiple-words/sjshfi-dfg.html

How it works:

  • .*<a\s+href="

    This matches everything before the link.

  • ([^"]*)

    This matches the link and captures it into group \1.

  • ".*

    This matches the double-quote after the line and everything that follows.

John1024
  • 109,961
  • 14
  • 137
  • 171
  • Thank you for this, it has made it a lot clearer and it has found one of the links that I am looking for. – knowlage Oct 30 '14 at 00:37
0

As anchor tag contain href tag, so searching href will solve the problem

sed -n '/href=".*"/p' link_file.txt
Hackaholic
  • 19,069
  • 5
  • 54
  • 72