0

I see other questions similar to this on SO but none solved my problem.

I have a local html page that I want to extract the links but I don't just want the links I want the whole tag that creates the links, like

<a href="page1.html">My Page 1</a>
<a href="page2.html">My Page 2</a>
<a href="page3.html">My Page 3</a>

I am fine with this if it is easier

My Page 1
page1.html
My Page 2
page2.html
My Page 3
page3.html

I have tried this command that is an answer on another question on SO

grep "<a href=" t2.html |
sed "s/<a href/\\n<a href/g" |
sed 's/\"/\"><\/a>\n/2' |
grep href

but for some reason it is just extracting a couple of links from the page

If you want to see, this is the page I am trying to extract the links.

thanks

Duck
  • 34,902
  • 47
  • 248
  • 470

2 Answers2

2
cat indexantigo.html | grep -oiE "<a([^>]+)>([^<]+)</a>"

It will match all inline <a> tags without others tags inside.

Details

<a([^>]+)>: Start with <a end by > and contain no >.

([^<]+): Contain no <

</a>: End by </a>

Note that will not match <a> tags with other tag in it. Such as <a href="#"><img src="1.jpg" /></a>

Edit: I agree with Anthony Geoghegan's answer, that it would be more convenient to use a script language such as Python.

Benoît Zu
  • 1,217
  • 12
  • 22
  • 1
    That's a good regex for matching links that are contained in one line. Unfortunately, the HTML source has many links that include line breaks so this only returns a subset of links in the page. – Anthony Geoghegan Sep 12 '17 at 09:50
  • 1
    Yes, absolutly, I edit my answer in order to precise this detail. – Benoît Zu Sep 12 '17 at 09:55
  • 1
    I'd also point about the the `-o` option is a GNU extension to `grep` (which may not be available on the Mac). In any case, I think this answer deserves an upvote as it's a good regex which would be useful for simple cases where the link text is all in one line. – Anthony Geoghegan Sep 12 '17 at 10:06
2

Grep and sed are the wrong tools for this task. They are both line-oriented utilities which process files or standard input line by line. However, the file you want to process has line breaks within the link text so these utilities won’t work.

In general, parsing HTML with regex is a bad idea. It would be better to use a dedicated HTML/XML parser (there should be a library available in whatever language you’re familiar with). For tasks such as this, I find it easier to create a Python script (certainly easier than shell programming) and use its Beautiful Soup library.

Anthony Geoghegan
  • 11,533
  • 5
  • 49
  • 56