How to stop at the next specific character in regex

Question

I have many links in a large variable, and am using regex to extract links. The most ideal link would look like

<a href="/search/product/?vendornum=StaplesA03">View Stock</a>

And my regex works perfectly looking for two matches: The full Link and the vendornum.

/<a href="\/search\/\product/(.*?)\/.*?>(.*?)<\/a>/igm

But occasionally, the link will include other info such as a class, which has it's own quotes

<a href="/search/title/?vendornum=StaplesA03" class="product-lister" >View Stock</a>

And the extra "s throw me off. I cannot figure out the first match, which would be the first two "s

<a href="([^"]+)".*[^>].*?>View Stock</a>

I know regex can be very challenging, and I am using RegEx101.com, a real life saver.

But I just can't seem to figure out how to match the first pattern, the full href link, but excluding any other classes with their own before I reach the closing >

Any experts in regex the can guide me?

While you can use a regex as *part* of a parser, trying to write a parser out of a single regex is just complicating things for yourself. Don't re-invent this complicated wheel; use an exiting HTML parser. — ikegami, Dec 09 '20 at 05:20
See also https://stackoverflow.com/questions/6751105/why-its-not-possible-to-use-regex-to-parse-html-xml-a-formal-explanation-in-la — tripleee, Dec 09 '20 at 07:13
I cleaned up the formatting of links in the question. If I missed something and you don't like it now you can "roll back" to your (previous) version: click on "edited..." link above my username (below the question text, to the left of your name), scroll down that page, and you'll see the link "rollback." — zdim, Dec 09 '20 at 21:18

zdim · Accepted Answer · 2020-12-11T06:14:19.927

There is generally no reason to build an HTML parser by hand, from scratch, while there's usually trouble awaiting down the road; regex are picky, sensitive to details, and brittle to even tiny input changes, while requirements tend to evolve. Why not use one of a few great HTML libraries?

An example with HTML::TreeBuilder (also extracting links, need stated in a comment)

use warnings;
use strict;
use feature 'say';

use HTML::TreeBuilder;

my $links_string = 
q(<a href="/search/title/?vendornum=StaplesA03" class="product-lister" >View Stock</a> 
  <a href="/search/title/?vendornum=StaplesA17" >View More Stock</a> );

my $dom = HTML::TreeBuilder->new_from_content($links_string);

my @links_html;
foreach my $tag ( $dom->look_down(_tag => "a") ) { 
    push @links_html, $tag->as_HTML;  # the whole link, as is
    my $href = $tag->attr("href"); 
    my ($name, $value) = $href =~ /\?([^=]+)=([^&]+)/;   #/
    say "$name = $value";

    say $tag->as_trimmed_text;     # or: ->as_text, keep some spaces
    # Or:
    # say for $tag->content_list;  # all children, and/or text
};
#say for @links_html;

I use a string with a newline between links for your "many links in a large variable", perhaps with some spaces around as well. This doesn't affect parsing done by the library.

A few commments

The workhorse here is HTML::Element class, with its powerful and flexible look_down method. If the string indeed has just links then you can probably use that class directly, but when done as above a full HTML document would parse just as well
Once I get the URL I use a very simple-minded regex to pull out a single name-value pair. Adjust if there can be more pairs, or let me know. Above all, use URI if there's more to it
The as_trimmed_text returns text parts of element's children, which in this case is presumably just the text of the link. The content_list returns all child nodes (same here)
Use URI::Escape if there are percent-encoded characters to convert, per RFC 3986

This prints

vendornum = StaplesA03
View Stock
vendornum = StaplesA17
View More Stock

Another option is Mojo::DOM, which is a part of a whole ecosystem

use warnings;
use strict;
use feature 'say';

use Mojo::DOM;

my $links_string = q( ... );  # as above

my $dom = Mojo::DOM->new($links_string);
 
my @links_html;
foreach my $node ( $dom->find('a')->each ) { 
    push @links_html, $node->to_string;  # or $node, gets stringified to HTML
    my $href = $node->attr('href');
    my ($name, $value) = $href =~ /\?([^=]+)=([^&]+)/;   #/
    say "$name = $value";

    say $node->text;
}
#say for @links_html;

I use the same approach as above, and this prints the same. But note that Mojolicious provides for yet other, convenient ways. Often, calls are chained using a range of its useful methods, and very fine navigation through HTML is easily done using CSS selectors.

While it is probably useful here to loop as above, as an example we can also do

my $v = $dom -> find('a') 
    -> map( 
        sub { 
            my ($name, $value) = $_->attr('href') =~ /\?(.+?)=([^&]+)/;  
            say "$name = $value"; 
            say $_->text;
        }
    );

what prints the same as above. See Mojo::Collection to better play with this.

The parameters in the URL can be parsed using Mojo::URL if you really know the name

my $value = Mojo::URL->new($href) 
    -> query
    -> param('vendornum');

If these aren't fixed then Mojo::Parameters is useful

my $param_names = Mojo::Parameters
    -> new( Mojo::URL->new($href)->query ) 
    -> names

where $param_names is an arrayref with names of all parameters in the query, or use

my $pairs = Mojo::Parameters->new( Mojo::URL->new($href)->query ) -> pairs;
# Or
# my %pairs = @{ Mojo::Parameters->new(Mojo::URL->new($href)->query) -> pairs };

which returns an arrayref with all name,value pairs listed in succession (what can be directly assigned to a hash, for instance).

An HTML document can be nicely parsed using XML::LibXML as well.

What I really enjoyed about this solution is that I still get to use the regex I finally figured out. Because that was hard. Thank you — LuisC329, Dec 10 '20 at 01:14
Of course, I didn't need the links gutted. For another part of the app, I do need th whole link. Looking to see if there's a property — LuisC329, Dec 10 '20 at 01:31
@LuisC329 OK, I'm glad you like it :). In this case regex is still fine for the query, but if things get more complicated you'd want libraries for that as well (like the liked ones). What do you mean by "_the whole link_" ? By all means browse the docs and you'll find a way! Or let me know... — zdim, Dec 10 '20 at 03:28
@LuisC329 (i meant "linked ones" for the libraries, not "liked" :) — zdim, Dec 10 '20 at 03:58
I meant I also for other purposed need the entire ..... line for another reason. I'll find a way in the manuals, am not dyslexic, but I have some weird reading problem, and it's worse when not reading off of paper — LuisC329, Dec 10 '20 at 22:08
OK, I see. Added to the code in the answer. With `HTML::TreeBuilder` that is `->as_HTML` method, which returns the node as HTML so in this case exactly teh whole link. With `Mojo::DOM` when you use the `$node` object where text might be expected it gets ["stringified"](https://perldoc.perl.org/perlglossary#stringification), to its HTML. (I also find reading digital content slightly ... disorienting, if that's the right word -- without reading issues otherwise) — zdim, Dec 11 '20 at 00:29
@LuisC329 In the meanwhile, I've edited a little more and added a couple of useful links. — zdim, Dec 11 '20 at 06:15
@LuisC329 (I realize I didn't tag you in one message above. I think you should still get notified in this case but if not here it is...) — zdim, Dec 11 '20 at 06:16
No tag notification. But thanks to you I was able to do it all, except find the original full link from ....., but it's a good tool, thanks — LuisC329, Dec 11 '20 at 15:34
@LuisC329 Cool. As for the original link (`...`), maybe I didn't explain it right: I added that to both programs. As it processes the string with all links (or any HTML document with links in it in fact), it stores each link as a string in `@links_html`. It's the first line in the loop (for each program), that gets the whole link and adds it to the array. The line that prints the array after the loop is commented out, uncomment it to see the actual links. — zdim, Dec 11 '20 at 20:35

score 1 · Answer 2 · answered Dec 09 '20 at 06:40

If I read correctly, you'd like to extract the vendornum value from the URL, and the link text. Best to use an html parser.

If you want to live dangerously with code that can break you can use a regex to parse html:

my $html = '<a href="/search/title/?vendornum=StaplesA03" class="product-lister" >View Stock</a>';
if($html =~ /<a href="[^\?]*\?vendornum=([^"]*)[^>]*>([^<]*).*$/) {
    print "vendornum: $1, link text: $2\n";
} else {
    print "no match";
}

Output:

vendornum: StaplesA03, link text: View Stock

Explanation:

vendornum=([^"]*) - scan for vendornum=, and capture everything after that until just before "
[^>]*> - scan over remaining attributes, such as class="", up to closing angle bracket
([^<]*) - capture link text
.*$ - scan up to end of text

score 1 · Answer 3 · answered Dec 09 '20 at 07:45

First of all you should consider using HTML::TreeBuilder for things like this. Once you get the hang of it it can be easier than coming up with regexes. However for quick and dirty tasks, a regex is fine.


$text =
'<a href="/search/title/?vendornum=StaplesA03" class="product-lister" >View Stock</a>
<a x=y href="/search/product/?Vendornum=651687" foo=bar>View Stockings</A>';

$regex =
qr{<a\s[^>]*?href="(?<link>[^"]*?\?vendornum=(?<vendornum>\w+)[^"]*)"[^>]*?>(?<desc>(?:(?!</a>).)*)</a>}i;

while($text =~ m/$regex/g){ Data:Dump::pp1 %+; }

Returns

{
  # tied Tie::Hash::NamedCapture
  desc => "View Stock",
  link => "/search/title/?vendornum=StaplesA03",
  vendornum => "StaplesA03",
}
{
  # tied Tie::Hash::NamedCapture
  desc => "View Stockings",
  link => "/search/product/?Vendornum=651687",
  vendornum => 651687,
}

HTH

How to stop at the next specific character in regex

3 Answers3