Find Favicons in HTML using Perl

Question

I'm trying to look for favicons (and variants) for a given URL using Perl (I'd like to avoid using an external service such as Google's favicon finder). There's a CPAN module, WWW::Favicon, but it hasn't been updated in over a decade -- a decade in which now important variants such as "apple-touch-icon" have come to replace the venerable "ico" file.

I thought I found the solution in WWW::Mechanize, since it can list all of the links in a given URL, including <link> header tags. However, I cannot seem to find a clean way to use the "find_link" method to search for the rel attribute.

For example, I tried using 'rel' as the search term, hoping maybe it was in there despite not being mentioned in the documentation, but it doesn't work. This code returns an error about an invalid "link-finding parameter."

my $results = $mech->find_link( 'rel' => "apple-touch-icon" );
use Data::Dumper;
say STDERR Dumper $results;

I also tried using other link-finding parameters, but none of them seem to be suited to searching out a rel attribute.

The only way I could figure out how to do it is by iterating through all links and looking for a rel attribute like this:

my $results = $mech->find_all_links(  );

foreach my $result (@{ $results }) {
    my $attrs = $result->attrs();
    #'tag' => "apple-touch-icon"
    
    foreach my $attr (sort keys %{ $attrs }) {
        if ($attrs->{'rel'} =~ /^apple-touch-icon.*$/) {
            say STDERR "I found it:" . $result->url();
        }

        # Add tests for other types of icons here.
        # E.g. "mask-icon" and "shortcut icon."

    }

}

That works, but it seems messy. Is there a better way?

You make a valid point about Mechanize not being able to do this. I've created [a PR](https://github.com/libwww-perl/WWW-Mechanize/pull/305) to add a filter by rel. — simbabque, Sep 16 '20 at 08:53
Thanks @simbabque! It seems like Mechanize would be prefect for something like this if they added that option. — Timothy R. Butler, Sep 17 '20 at 00:19
brian's solution is great. I'd go with that. But I think we'll be adding it to Mechanize anyway (disclaimer: I'm one of the maintaining contributors). — simbabque, Sep 17 '20 at 07:39
Thanks @simbabque for your work on Mechanize -- I'm glad to have run into it. I'm using it with my initial hacked approach at least until I figure out how to do what I'm discussing with Brian in the comments on his answer. If it had what you put in the PR, it really would work perfectly. — Timothy R. Butler, Sep 17 '20 at 16:25
You're very welcome. It's already been approved, and apart from a typo I just fixed it should be good. I would assume there's going to be a release in the evening today in Canadian time. — simbabque, Sep 18 '20 at 11:16
I'll give it a try, thanks! Perhaps I should work it into an answer and share it on here. — Timothy R. Butler, Sep 26 '20 at 16:32

brian d foy · Accepted Answer · 2020-09-16T14:46:24.783

Here's how I'd do it with Mojo::DOM. Once you fetch an HTML page, use dom to do all the parsing. From that, use a CSS selector to find the interesting nodes:

link[rel*=icon i][href]

This CSS selector looks for link tags that have the rel and href tags at the same time. Additionally, I require that the value in rel contain (*=) "icon", case insensitively (the i). If you want to assume that all nodes will have the href, just leave off [href].

Once I have the list of links, I extract just the value in href and turn that list into an array reference (although I could do the rest with Mojo::Collection methods):

use v5.10;

use Mojo::UserAgent;
my $ua = Mojo::UserAgent->new->max_redirects(3);

my $results = $ua->get( shift )
    ->result
    ->dom
    ->find( 'link[rel*=icon i][href]' )
    ->map( attr => 'href' )
    ->to_array
    ;

say join "\n", @$results;

That works pretty well so far:

$ perl mojo.pl https://www.perl.org
https://cdn.perl.org/perlweb/favicon.ico

$ perl mojo.pl https://www.microsoft.com
https://c.s-microsoft.com/favicon.ico?v2

$ perl mojo.pl https://leanpub.com/mojo_web_clients
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/apple-touch-icon-57x57-b83f183ad6b00aa74d8e692126c7017e.png
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/apple-touch-icon-60x60-6dc1c10b7145a2f1156af5b798565268.png
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/apple-touch-icon-72x72-5037b667b6f7a8d5ba8c4ffb4a62ec2d.png
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/apple-touch-icon-76x76-57860ca8a817754d2861e8d0ef943b23.png
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/apple-touch-icon-114x114-27f9c42684f2a77945643b35b28df6e3.png
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/apple-touch-icon-120x120-3819f03d1bad1584719af0212396a6fc.png
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/apple-touch-icon-144x144-a79479b4595dc7ca2f3e6f5b962d16fd.png
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/apple-touch-icon-152x152-aafe015ef1c22234133158a89b29daf5.png
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/favicon-16x16-c1207cd2f3a20fd50de0e585b4b307a3.png
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/favicon-32x32-e9b1d6ef3d96ed8918c54316cdea011f.png
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/favicon-96x96-842fcd3e7786576fc20d38bbf94837fc.png
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/favicon-128x128-e97066b91cc21b104c63bc7530ff819f.png
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/favicon-196x196-b8cab44cf725c4fa0aafdbd237cdc4ed.png

Now, the problem comes if you find more interesting cases that you can't easily write a selector for. Suppose not all of the rel values have "icon" in them. You can get a little more fancy by specifying multiple selectors separated by commas so you don't have to use the experimental case insensitivity flag:

link[rel*=icon][href], link[rel*=ICON][href]

or different values in rel:

link[rel="shortcut icon"][href], link[rel="apple-touch-icon-precomposed"][href]

Line up as many of those as you like.

But, you could also filter your results without the selectors. Use Mojo::Collection's grep to pick out the nodes that you want:

my %Interesting = ...;
my $results = $ua->get( shift )
    ->result
    ->dom
    ->find( '...' )
    ->grep( sub { exists $Interesting{ $_->attr('rel') } } )
    ->map( attr => 'href' )
    ->to_array
    ;

I have a lot more examples of Mojo::DOM in Mojo Web Clients, and I think I'll go add this example now.

Thanks, @briandfoy! If I wanted to do the analysis in several pieces, is there a good way to do that? For example, I'd probably prefer "apple-touch-icon-precomposed" over "favicon-16x16" if both are present. I've not played with Mojolicious before -- the syntax looks a tad different. Could I ask for two attributes in map (e.g. what the rel is and the href)? — Timothy R. Butler, Sep 17 '20 at 00:22
You can do anything you like in the `map`. That's really a shortcut for `map( sub { .... } )` where the item is in `$_` inside the sub. — brian d foy, Sep 17 '20 at 13:50
I think I'm starting to catch on. So, the anonymous subroutine in `map` could look at the different parts of `$_` and the output them? What format does `to_array` expect from that subroutine? Could it `return { 'rel' => $_->attr('rel'), 'href' => $_->attr('href') }` if I wanted the array to be an array of hashes with that data? Sorry for the dumb question -- I've never played with Mojolicious before. — Timothy R. Butler, Sep 17 '20 at 16:28
If you have a different question, ask a new question :) Most of the stuff you see there are from Mojo::Collection, so that's what you should play with. And, I have many examples in Mojo Web Clients. — brian d foy, Sep 17 '20 at 23:24
@TimothyR.Butler just keep in mind that Mojo is also just Perl. It's all Perl code, you can do anything you like with it. :) — simbabque, Sep 18 '20 at 11:17

score 0 · Answer 2 · answered Sep 15 '20 at 19:05

0

The problem is very easy to solve with:

assistance of any module allowing to load webpage
define $regex for all possible favicon variations
look for <link rel="$regex" href="icon_address" ...>

Note: The script has default YouTube url embedded in the code

use strict;
use warnings;
use feature 'say';

use HTTP::Tiny;

my $url = shift || 'https://www.youtube.com/';

my $icons = get_favicon($url);

say for @{$icons};

sub get_favicon {
    my $url = shift;
    
    my @lookup = (
                    'shortcut icon',
                    'apple-touch-icon',
                    'image_src',
                    'icon',
                    'alternative icon'
                );
                
    my $re      = join('|',@lookup);
    my $html    = load_page($url);
    my @icons   = ($html =~ /<link rel="(?:$re)" href="(.*?)"/gmsi);
    
    return \@icons;
}

sub load_page {
    my $url = shift;
    
    my $response = HTTP::Tiny->new->get($url);
    my $html;

    if ($response->{success}) {
        $html = $response->{content};
    } else {
        say 'ERROR:  Could not extract webpage';
        say 'Status: ' . $response->{status};
        say 'Reason: ' . $response->{reason};
        exit;
    }

    return $html;
}

Run as script.pl

https://www.youtube.com/s/desktop/8259e7c9/img/favicon.ico
https://www.youtube.com/s/desktop/8259e7c9/img/favicon_32.png
https://www.youtube.com/s/desktop/8259e7c9/img/favicon_48.png
https://www.youtube.com/s/desktop/8259e7c9/img/favicon_96.png
https://www.youtube.com/s/desktop/8259e7c9/img/favicon_144.png
https://www.youtube.com/img/desktop/yt_1200.png

Run as script.pl "http://www.microsoft.com/"

https://c.s-microsoft.com/favicon.ico?v2

Run as script.pl "http://finance.yahoo.com/"

https://s.yimg.com/cv/apiv2/default/icons/favicon_y19_32x32_custom.svg

answered Sep 15 '20 at 19:05

Polar Bear

6,762
1
5
12

3

"Parsing" HTML with a regex is a bit brittle. I like the OP's ideas of useing WWW::Mechanize better. Your code has more possible file names, though. Probably it's best to merge both approaches. – Robert Sep 15 '20 at 19:09
@Robert -- The howitzer is not right weapon against fleas, but still can be utilized. – Polar Bear Sep 15 '20 at 19:16
2

@Robert, PB's implication that using a proper parse would be more complex is nonsense --it's usually less complex than the regex alternative-- but they do have a point. The risk of the regex breaking of producing something incorrect is low. And it might be faster than using a proper HTML parser ....or would it? – ikegami Sep 15 '20 at 22:53
The biggest bottleneck would be the download rate, so using a proper parser wouldn't have any negative performance effects. The biggest savings would be from using a downloader that produces the chunks downloaded as they are downloaded in order to terminate the download as soon as the header has been fully received. (You could still use a proper parser if it's a pull parser. And this could very well be faster than using regex.) – ikegami Sep 15 '20 at 22:56
1

Thanks, everyone. @PolarBear’s approach does seem simpler than mine, although I do like WWW::Mechanize’s resilience. For example, what if a slightly poorly formed site has the rel and href attributes reversed? (Off hand, maybe that isn’t even technically invalid.) Perhaps an option would be to split out the name/value pairs of all the attributes using regex and then analyze them so that it isn’t dependent on a certain order? At that point, is it better to stick with WWW::Mechanize? – Timothy R. Butler Sep 16 '20 at 07:07
1

Using a parser like Mojo::DOM with Mojo::UA should work well. I had a quick look at the WWW::Favicon module. Found its github and it has a 5 year old PR that hasn't been merged, but the maintainer owns over 40 dists on CPAN with the latest release in 2018 and his github isn't inactive. I could message him and take over the module. – simbabque Sep 16 '20 at 09:27
1

This code requires the HTML to be of a particular form. That's a bad idea. – brian d foy Sep 16 '20 at 13:51

Find Favicons in HTML using Perl

2 Answers2

Linked