There is generally no reason to build an HTML parser by hand, from scratch, while there's usually trouble awaiting down the road; regex are picky, sensitive to details, and brittle to even tiny input changes, while requirements tend to evolve. Why not use one of a few great HTML libraries?
An example with HTML::TreeBuilder (also extracting links, need stated in a comment)
use warnings;
use strict;
use feature 'say';
use HTML::TreeBuilder;
my $links_string =
q(<a href="/search/title/?vendornum=StaplesA03" class="product-lister" >View Stock</a>
<a href="/search/title/?vendornum=StaplesA17" >View More Stock</a> );
my $dom = HTML::TreeBuilder->new_from_content($links_string);
my @links_html;
foreach my $tag ( $dom->look_down(_tag => "a") ) {
push @links_html, $tag->as_HTML; # the whole link, as is
my $href = $tag->attr("href");
my ($name, $value) = $href =~ /\?([^=]+)=([^&]+)/; #/
say "$name = $value";
say $tag->as_trimmed_text; # or: ->as_text, keep some spaces
# Or:
# say for $tag->content_list; # all children, and/or text
};
#say for @links_html;
I use a string with a newline between links for your "many links in a large variable", perhaps with some spaces around as well. This doesn't affect parsing done by the library.
A few commments
The workhorse here is HTML::Element class, with its powerful and flexible look_down
method. If the string indeed has just links then you can probably use that class directly, but when done as above a full HTML document would parse just as well
Once I get the URL I use a very simple-minded regex to pull out a single name-value pair. Adjust if there can be more pairs, or let me know. Above all, use URI if there's more to it
The as_trimmed_text
returns text parts of element's children, which in this case is presumably just the text of the link. The content_list returns all child nodes (same here)
Use URI::Escape if there are percent-encoded characters to convert, per RFC 3986
This prints
vendornum = StaplesA03
View Stock
vendornum = StaplesA17
View More Stock
Another option is Mojo::DOM, which is a part of a whole ecosystem
use warnings;
use strict;
use feature 'say';
use Mojo::DOM;
my $links_string = q( ... ); # as above
my $dom = Mojo::DOM->new($links_string);
my @links_html;
foreach my $node ( $dom->find('a')->each ) {
push @links_html, $node->to_string; # or $node, gets stringified to HTML
my $href = $node->attr('href');
my ($name, $value) = $href =~ /\?([^=]+)=([^&]+)/; #/
say "$name = $value";
say $node->text;
}
#say for @links_html;
I use the same approach as above, and this prints the same. But note that Mojolicious provides for yet other, convenient ways. Often, calls are chained using a range of its useful methods, and very fine navigation through HTML is easily done using CSS selectors.
While it is probably useful here to loop as above, as an example we can also do
my $v = $dom -> find('a')
-> map(
sub {
my ($name, $value) = $_->attr('href') =~ /\?(.+?)=([^&]+)/;
say "$name = $value";
say $_->text;
}
);
what prints the same as above. See Mojo::Collection to better play with this.
The parameters in the URL can be parsed using Mojo::URL if you really know the name
my $value = Mojo::URL->new($href)
-> query
-> param('vendornum');
If these aren't fixed then Mojo::Parameters is useful
my $param_names = Mojo::Parameters
-> new( Mojo::URL->new($href)->query )
-> names
where $param_names
is an arrayref with names of all parameters in the query, or use
my $pairs = Mojo::Parameters->new( Mojo::URL->new($href)->query ) -> pairs;
# Or
# my %pairs = @{ Mojo::Parameters->new(Mojo::URL->new($href)->query) -> pairs };
which returns an arrayref with all name,value pairs listed in succession (what can be directly assigned to a hash, for instance).
An HTML document can be nicely parsed using XML::LibXML
as well.