-1

This should be a simple regex but I can't seem to figure it out.

Can someone please provide a 1-liner to take any string of arbitrary HTML input and populate an array with all the Facebook URLs (matching http://www.facebook.com) that were in the HTML code?

I don't want to use any CPAN modules and would much prefer a simple regex 1-liner.

Thanks in advance for your help!

Russell C.
  • 1,649
  • 6
  • 33
  • 55
  • 1
    Take a look at this answer: http://stackoverflow.com/questions/30847/regex-to-validate-uris – supercheetah Dec 12 '10 at 23:13
  • 1
    **Arbitrary** HTML, eh? And it has to be “on one line”, one line? I hope it doesn’t also have to fit in 80 columns! And no CPAN modules. Well, I **CAN** do it, but you don’t want me to, I’m sure. Do you want a correct answer, or one that only works now and then? What about URLs within comments or script segments? What about stuff hidden by entities? Can there be comments in the middle of the tags? – tchrist Feb 26 '11 at 01:30

4 Answers4

4

Obligatory link explaining why you shouldn't parse HTML using a regular expression.

That being said, try this for a quick and dirty solution:

my $html = '<a href="http://www.facebook.com/">A link!</a>';
my @links = $html =~ /<a[^>]*\shref=['"](https?:\/\/www\.facebook\.com[^"']*)["']/gis;
Community
  • 1
  • 1
Cameron
  • 96,106
  • 25
  • 196
  • 225
  • That was what I was looking for and I appreciate the explanation of why not to use regex. I wanted something quick & dirty and will go back and clean up later. Thanks. – Russell C. Dec 12 '10 at 23:54
  • 1
    I'm opposed to telling people how to do this on principle, but +1 anyhow for using negated character classes instead of `.*?` (or, worse, just `.*`). – Dave Sherohman Dec 13 '10 at 11:43
4

See HTML::LinkExtor. There is no point wasting your life energy (nor ours) trying to use regular expressions for these types of tasks.

You can read the documentation for a Perl module installed on your computer by using the perldoc utility. For example, perldoc HTML::LinkExtor. Usually, module documentation begins with an example of how to use the module.

Here is a slightly more modern adaptation of one of the examples in the documentation:

#!/usr/bin/env perl

use v5.20;
use warnings;

use feature 'signatures';
no warnings 'experimental::signatures';

use autouse Carp => qw( croak );

use HTML::LinkExtor qw();
use HTTP::Tiny qw();
use URI qw();

run( $ARGV[0] );

sub run ( $url ) {
    my @images;

    my $parser = HTML::LinkExtor->new(
        sub ( $tag, %attr ) {
            return unless $tag eq 'img';
            push @images, { %attr };
            return;
        }
    );

    my $response = HTTP::Tiny->new->get( $url, {
            data_callback => sub { $parser->parse($_[0]) }
        }
    );

    unless ( $response->{success} ) {
        croak sprintf('%d: %s', $response->{status}, $response->{reason});
    }

    my $base = $response->{url};

    for my $image ( @images ) {
        say URI->new_abs( $image->{src}, $base )->as_string;

    }
}

Output:

$ perl t.pl https://www.perl.com/
https://www.perl.com/images/site/perl-onion_20.png
https://www.perl.com/images/site/twitter_20.png
https://www.perl.com/images/site/rss_20.png
https://www.perl.com/images/site/github_light_20.png
https://www.perl.com/images/site/perl-camel.png
https://www.perl.com/images/site/perl-onion_20.png
https://www.perl.com/images/site/twitter_20.png
https://www.perl.com/images/site/rss_20.png
https://www.perl.com/images/site/github_light_20.png
https://i.creativecommons.org/l/by-nc/3.0/88x31.png
Sinan Ünür
  • 116,958
  • 15
  • 196
  • 339
  • if we decided to go the HTML::LinkExtor direction could you provide some sample code of how this might work. Thanks! – Russell C. Dec 12 '10 at 23:54
  • 1
    why bother trying to help the guy if all you are going to say is "see the documentation" – Literat Feb 25 '11 at 23:35
  • code was stale, so I fixed it up so it should run on debian 10... https://gist.github.com/kanliot/dbb81b40e257ca315d6903f852547e18 – marinara Sep 30 '20 at 07:37
1

Russell C, have you seen the beginning of the Facebook movie, where Mark Zuckerburg uses Perl to automatically extract all the photos from a college facebook (and then posted them online). I was like "that's how i'd do it! I'd use Perl too!" (except it would probably take me a few days to work out, not 2 minutes). Anyway I'd use the module WWW::Mechanize to extract links (or photos):

use strict; use WWW::Mechanize; open (OUT, ">out.txt"); my $url="http://www.facebook.com"; my $mech=WWW::Mechanize->new(); $mech->get($url); my @a = $mech->links; print OUT "\n", $a[$_]->url for (0..$#a);

However this won't log you in to your facebook page, it will just take you to the log in screen. I'd use HTTP::Cookies to log in. For that, see the documentation. Only joking, just ask. Oh god the apple strudle is burning!

Literat
  • 869
  • 7
  • 14
0

Maybe this can help you:

if ($input =~ /(http:\/\/www\.facebook\.com\/\S+)/) { push(@urls, $1); }
Pirooz
  • 1,268
  • 1
  • 13
  • 24
  • without commenting on the regex, why not slurp the whole html page in, then do something like `@urls = $html =~ /([regex])/gm` or maybe `/gs`, I always forget. Still, you get all the matches in one shot. – Joel Berger Feb 26 '11 at 04:57