Here's how I'd do it with Mojo::DOM. Once you fetch an HTML page, use dom
to do all the parsing. From that, use a CSS selector to find the interesting nodes:
link[rel*=icon i][href]
This CSS selector looks for link
tags that have the rel
and href
tags at the same time. Additionally, I require that the value in rel
contain (*=
) "icon", case insensitively (the i
). If you want to assume that all nodes will have the href
, just leave off [href]
.
Once I have the list of links, I extract just the value in href
and turn that list into an array reference (although I could do the rest with Mojo::Collection
methods):
use v5.10;
use Mojo::UserAgent;
my $ua = Mojo::UserAgent->new->max_redirects(3);
my $results = $ua->get( shift )
->result
->dom
->find( 'link[rel*=icon i][href]' )
->map( attr => 'href' )
->to_array
;
say join "\n", @$results;
That works pretty well so far:
$ perl mojo.pl https://www.perl.org
https://cdn.perl.org/perlweb/favicon.ico
$ perl mojo.pl https://www.microsoft.com
https://c.s-microsoft.com/favicon.ico?v2
$ perl mojo.pl https://leanpub.com/mojo_web_clients
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/apple-touch-icon-57x57-b83f183ad6b00aa74d8e692126c7017e.png
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/apple-touch-icon-60x60-6dc1c10b7145a2f1156af5b798565268.png
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/apple-touch-icon-72x72-5037b667b6f7a8d5ba8c4ffb4a62ec2d.png
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/apple-touch-icon-76x76-57860ca8a817754d2861e8d0ef943b23.png
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/apple-touch-icon-114x114-27f9c42684f2a77945643b35b28df6e3.png
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/apple-touch-icon-120x120-3819f03d1bad1584719af0212396a6fc.png
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/apple-touch-icon-144x144-a79479b4595dc7ca2f3e6f5b962d16fd.png
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/apple-touch-icon-152x152-aafe015ef1c22234133158a89b29daf5.png
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/favicon-16x16-c1207cd2f3a20fd50de0e585b4b307a3.png
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/favicon-32x32-e9b1d6ef3d96ed8918c54316cdea011f.png
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/favicon-96x96-842fcd3e7786576fc20d38bbf94837fc.png
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/favicon-128x128-e97066b91cc21b104c63bc7530ff819f.png
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/favicon-196x196-b8cab44cf725c4fa0aafdbd237cdc4ed.png
Now, the problem comes if you find more interesting cases that you can't easily write a selector for. Suppose not all of the rel
values have "icon" in them. You can get a little more fancy by specifying multiple selectors separated by commas so you don't have to use the experimental case insensitivity flag:
link[rel*=icon][href], link[rel*=ICON][href]
or different values in rel
:
link[rel="shortcut icon"][href], link[rel="apple-touch-icon-precomposed"][href]
Line up as many of those as you like.
But, you could also filter your results without the selectors. Use Mojo::Collection's grep
to pick out the nodes that you want:
my %Interesting = ...;
my $results = $ua->get( shift )
->result
->dom
->find( '...' )
->grep( sub { exists $Interesting{ $_->attr('rel') } } )
->map( attr => 'href' )
->to_array
;
I have a lot more examples of Mojo::DOM
in Mojo Web Clients, and I think I'll go add this example now.