If you have a -P
option in your grep so that it accepts PCRE patterns, you should be able to use better regexes. Sometimes a minimal quantifier like *?
helps. Also, you’re getting the whole input line, not just the match itself; if you have a -o
option to grep, it will list only the part that matches.
egrep -Po '<a[^<>]*>'
If your grep doesn’t have those options, try
perl -00 -nle 'print $1 while /(<a[^<>]*>)/gi'
Which now crosses line boundaries.
To do a real parse of HTML requires regexes subtantially more more complex than you are apt to wish to enter on the command line. Here’s one example, and here’s another. Those may not convince you to try a non-regex approach, but they should at least show you how much harder it is in the general case than in specific ones.
This answer shows why all things are possible, but not all are expedient.