The simplest way to do this is using the following regular expression.
/href="([^"]+)"/
This will get all characters from the first quote until it finds a character that is a quote. This is, in most languages, the fastest way to get a quoted string, that can't itself contain quotes. Quotes should be encoded when used in attributes.
UPDATE: A complete Perl program for parsing URLs would look like this:
use 5.010;
while (<>) {
push @matches, m/href="([^"]+)"/gi;
push @matches, m/href='([^']+)'/gi;
push @matches, m/href=([^"'][^>\s]*)[>\s]+/gi;
say for @matches;
}
It reads from stdin and prints all URLs. It takes care of the three possible quotes. Use it with curl
to find all the URLs in a webpage:
curl url | perl urls.pl