Perl Regex to extract URLs from HTML

Question

This should be a simple regex but I can't seem to figure it out.

Can someone please provide a 1-liner to take any string of arbitrary HTML input and populate an array with all the Facebook URLs (matching http://www.facebook.com) that were in the HTML code?

I don't want to use any CPAN modules and would much prefer a simple regex 1-liner.

Thanks in advance for your help!

Take a look at this answer: http://stackoverflow.com/questions/30847/regex-to-validate-uris — supercheetah, Dec 12 '10 at 23:13
**Arbitrary** HTML, eh? And it has to be “on one line”, one line? I hope it doesn’t also have to fit in 80 columns! And no CPAN modules. Well, I **CAN** do it, but you don’t want me to, I’m sure. Do you want a correct answer, or one that only works now and then? What about URLs within comments or script segments? What about stuff hidden by entities? Can there be comments in the middle of the tags? — tchrist, Feb 26 '11 at 01:30

score 4 · Accepted Answer · edited May 23 '17 at 11:55

4

Obligatory link explaining why you shouldn't parse HTML using a regular expression.

That being said, try this for a quick and dirty solution:

my $html = '<a href="http://www.facebook.com/">A link!</a>';
my @links = $html =~ /<a[^>]*\shref=['"](https?:\/\/www\.facebook\.com[^"']*)["']/gis;

edited May 23 '17 at 11:55

Community

1
1

answered Dec 12 '10 at 23:18

Cameron

96,106
25
196
225

That was what I was looking for and I appreciate the explanation of why not to use regex. I wanted something quick & dirty and will go back and clean up later. Thanks. – Russell C. Dec 12 '10 at 23:54
1

I'm opposed to telling people how to do this on principle, but +1 anyhow for using negated character classes instead of `.*?` (or, worse, just `.*`). – Dave Sherohman Dec 13 '10 at 11:43

Sinan Ünür · Answer 2 · 2018-08-13T15:26:42.540

See HTML::LinkExtor. There is no point wasting your life energy (nor ours) trying to use regular expressions for these types of tasks.

You can read the documentation for a Perl module installed on your computer by using the perldoc utility. For example, perldoc HTML::LinkExtor. Usually, module documentation begins with an example of how to use the module.

Here is a slightly more modern adaptation of one of the examples in the documentation:

#!/usr/bin/env perl

use v5.20;
use warnings;

use feature 'signatures';
no warnings 'experimental::signatures';

use autouse Carp => qw( croak );

use HTML::LinkExtor qw();
use HTTP::Tiny qw();
use URI qw();

run( $ARGV[0] );

sub run ( $url ) {
    my @images;

    my $parser = HTML::LinkExtor->new(
        sub ( $tag, %attr ) {
            return unless $tag eq 'img';
            push @images, { %attr };
            return;
        }
    );

    my $response = HTTP::Tiny->new->get( $url, {
            data_callback => sub { $parser->parse($_[0]) }
        }
    );

    unless ( $response->{success} ) {
        croak sprintf('%d: %s', $response->{status}, $response->{reason});
    }

    my $base = $response->{url};

    for my $image ( @images ) {
        say URI->new_abs( $image->{src}, $base )->as_string;

    }
}

Output:

$ perl t.pl https://www.perl.com/
https://www.perl.com/images/site/perl-onion_20.png
https://www.perl.com/images/site/twitter_20.png
https://www.perl.com/images/site/rss_20.png
https://www.perl.com/images/site/github_light_20.png
https://www.perl.com/images/site/perl-camel.png
https://www.perl.com/images/site/perl-onion_20.png
https://www.perl.com/images/site/twitter_20.png
https://www.perl.com/images/site/rss_20.png
https://www.perl.com/images/site/github_light_20.png
https://i.creativecommons.org/l/by-nc/3.0/88x31.png

if we decided to go the HTML::LinkExtor direction could you provide some sample code of how this might work. Thanks! — Russell C., Dec 12 '10 at 23:54
why bother trying to help the guy if all you are going to say is "see the documentation" — Literat, Feb 25 '11 at 23:35
code was stale, so I fixed it up so it should run on debian 10... https://gist.github.com/kanliot/dbb81b40e257ca315d6903f852547e18 — marinara, Sep 30 '20 at 07:37

score 1 · Answer 3 · answered Feb 25 '11 at 23:41

Russell C, have you seen the beginning of the Facebook movie, where Mark Zuckerburg uses Perl to automatically extract all the photos from a college facebook (and then posted them online). I was like "that's how i'd do it! I'd use Perl too!" (except it would probably take me a few days to work out, not 2 minutes). Anyway I'd use the module WWW::Mechanize to extract links (or photos):

use strict; use WWW::Mechanize; open (OUT, ">out.txt"); my $url="http://www.facebook.com"; my $mech=WWW::Mechanize->new(); $mech->get($url); my @a = $mech->links; print OUT "\n", $a[$_]->url for (0..$#a);

However this won't log you in to your facebook page, it will just take you to the log in screen. I'd use HTTP::Cookies to log in. For that, see the documentation. Only joking, just ask. Oh god the apple strudle is burning!

score 0 · Answer 4 · answered Dec 12 '10 at 23:13

0

Maybe this can help you:

if ($input =~ /(http:\/\/www\.facebook\.com\/\S+)/) { push(@urls, $1); }

answered Dec 12 '10 at 23:13

Pirooz

1,268
1
13
24

without commenting on the regex, why not slurp the whole html page in, then do something like `@urls = $html =~ /([regex])/gm` or maybe `/gs`, I always forget. Still, you get all the matches in one shot. – Joel Berger Feb 26 '11 at 04:57

Perl Regex to extract URLs from HTML

4 Answers4