0

I have a variable $content containing a paragraph of mixed text and HTML img tags and URLs.

I would like to make conditional string injection to do some replacement.

For example, suppose $content contains

ABC <img src="http://url1.com/keep.jpg">
DEF <img src="http://random-url.com/replace.jpg">
GHI <img src="http://url2.com/keep.jpg">

I would like to edit $content and make it

ABC <img src="http://url1.com/keep.jpg"> 
DEF <img src="http://wrapper-url.com/random-url.com/replace.jpg"> 
GHI <img src="http://url2.com/keep.jpg">

I have a list of regex conditions for URLs to keep: the said whitelist matches. Any image URL other than the whitelist will be edited with a wrapper-url prefix.

My idea was:

if image tags matched in $content {
  if match is in 'whitelist'
    do nothing
  else
    inject prefix replacement
}

I don't know how to make conditional regex global replacement since everything is in a single-line string variable.

I need to implement this in Perl.


Additional information:

My 'whitelist' is only currently 5 lines, basically containing keyword and domains.

Here's what I've been doing for matching the 'whitelist'.

eg.

if ($_ =~ /s3\.static\.cdn\.net/) {
    # whitelist to keep, subdomain match
}
elsif ($_ =~ /keyword-to-keep/) {
    # whitelist to keep, url keyword match
}
elsif ($_ =~ /cdn\.domain\.com/) {
    # whitelist to keep, subdomain match
}
elsif ($_ =~ /whitelist-domain\.net/) {
    # whitelist to keep, domain match
}
elsif ($_ =~ /i\.whitelist-domain\.com/) {
    # whitelist to keep, subdomain match
}
else {
    # matched, do something about it with injection
}


A not so elegant solution I can think of is to globally replace all img urls with the prefix injection.

Then do another global replacement to remove the prefix by matching against the 'whitelist'.

Is there a more efficient solution to my problem?

Thanks.

KDX
  • 611
  • 2
  • 10
  • 22
  • You really need a proper HTML parser for this. Please show a sample of your *list of regex conditions* – Borodin Apr 02 '16 at 15:23
  • Original question modified with some regex conditions I've been using to check against for the 'whitelist' to keep. – KDX Apr 02 '16 at 16:03

2 Answers2

0

As others have mentioned, using RE's to parse HTML is strongly recommended against - see here (amoung many other places) for the reasons.

Since your example data is short and simple, you can ignore the advice as long as you keep in mind the limitations. Some of the

things to consider are;

  1. What if your whitelist keyword matches part of the domain?
  2. or vice versa - what if a domain (.net) is part of the path?
  3. What happens if the scheme is something other than http(s)?
  4. What if the URL is not in double quotes? or any quotes at all?
  5. What if there is something that looks like a tag in the "pre-text"?
  6. Are entries on the whitelist case-sensitive? Domain names are not; paths are; so what to do?

A couple of principles I've used in the solution below are;

  • separate regex specification from regex use
  • always use extended mode regexs ie: use '/x' option
  • pre-process the whitelist to make an array of RE "tests" to pass
  • unix filter style - read on STDIN, write on STDOUT, warn on STDERR
  • use a module for the detail of handling parts of the URL

Given those things to consider, this will basically do it;

use v5.12;
use URI::URL;

my $wrapper_host   =  "wrapper-url.com" ;
my $whitelist_file =  "whitelist.txt"   ;
URI::URL::strict 1;   # Will croak if cannot determine scheme

my $text_re    = qr/ ^ ( \s* [^<]+ \s* ) /x ;
my $quoted_str = qr/ " ( [^"]+ ) " /x ;
my $img_tag_re = qr/ < img \s+ src= $quoted_str >  /x ;

my @whitelist_rules ;
open(my $white, '<', $whitelist_file) or die "$whitelist_file: $!\n" ;
while (<$white>) {
    chomp;
    s/\./\\./;   # escape '.'
    push @whitelist_rules, qr/$_/ ;
}
close $white ;

while (<>) {

    # Parse the line into text and url
    my $text;  my $url;
    if (/ $text_re  $img_tag_re /x) {
        $text = $1 ;
        $url = new URI::URL $2 ;  # may croak
    }
    else {
        warn "Can't make sense of line $., skipping..." ;
        next ;
    }

    # iterate over @whitelist_rules to see if this one is exempt
    my $on_whitelist = 0;
    for my $r (@whitelist_rules) {
        $on_whitelist++ if $url =~ /$r/i ;            # Note: '/i'
        # $on_whitelist++ if $url->netloc =~ /$r/i ;  # alternatively ...
        # $on_whitelist++ if $url->path   =~ /$r/i ;  # alternatively ...
    }

    # If its not on the whitelist, wrap netloc
    if ( ! $on_whitelist )  {
        $url->path( $url->netloc . $url->path );
        $url->netloc( $wrapper_host );
    }

    # output the transformed line
    say $text . $url ;
}
Community
  • 1
  • 1
Marty
  • 2,788
  • 11
  • 17
  • Thank you for the detail analysis of scenario I didn't think of. I ended up with HTML::TokeParser::Simple for image url extraction instead of using RE, match against my whitelist, then save it back to the original $content variable. – KDX Apr 03 '16 at 12:45
0
  1. You can use HTML:TokeParser:Simple to locate an img tag and extract the url from its src attribute.

  2. You can extract the host name from the url with URI:URL.

  3. You can convert your whitelist into a set for easy and efficient host name lookups.

  4. You can use the s// operator to wrap host names that are not in the whitelist.


use strict;
use warnings; 
use 5.020;
use HTML::TokeParser::Simple;
use URI::URL;
use List::Util qw{ any };

my @white_list = qw(
    s3.static.cdn.net
    cdn.domain.com
    whitelist-domain.net
    i.whitelist-domain.com
);
#Create a set:
my %white_list = map {$_ => undef} @white_list;

my @accepted_keywords = qw(
    xxx.xxx
    cool
);
#Escape any special regex characters appearing in the keywords:
@accepted_keywords = map { quotemeta $_ } @accepted_keywords;

my $wrapper_host = "wrapper-url.com";

my $content = <<END_OF_CONTENT;
ABC <img src="http://i.whitelist-domain.com/keep.jpg">
DEF <img src="http://random-url.com/replace.jpg">
GHI <img src="http://cdn.domain.com/keep.jpg">
XYZ <img src="http://random-url.com/replace.jpg">
ZZZ <img src="http://xxx.xxx/keep.jpg">
ZZZ <img src="http://xxxXxxx/replace.jpg">
ZZZ <img src="http://waycool.com/keep.jpg">
END_OF_CONTENT

my $parser = HTML::TokeParser::Simple->new(\$content);

my ($src, $url, $host, $regex);
while (my $token = $parser->get_token() ) {

    if ($token->is_tag('img') ) {
        if ($src = $token->get_attr('src') ) {
            $url = URI::URL->new($src);
            $host = $url->host;

            next if exists($white_list{$host});
            next if any { $host =~ /$_/ } @accepted_keywords;

            $src =~ s/(http:\/\/)/$1$wrapper_host\//xms;
            $token->set_attr(
                'src',
                $src,
            );

        }
    }
}
continue {
    print $token->as_is;
}

--output:--
ABC <img src="http://i.whitelist-domain.com/keep.jpg">
DEF <img src="http://wrapper-url.com/random-url.com/replace.jpg">
GHI <img src="http://cdn.domain.com/keep.jpg">
XYZ <img src="http://wrapper-url.com/random-url.com/replace.jpg">
ZZZ <img src="http://xxx.xxx/keep.jpg">
ZZZ <img src="http://wrapper-url.com/xxxXxxx/replace.jpg">
ZZZ <img src="http://waycool.com/keep.jpg">
7stud
  • 46,922
  • 14
  • 101
  • 127
  • Indeeds, using HTML::TokeParser::Simple is a much cleaner solution to my problem. With minor modification, this solution works perfect for me. Thanks. – KDX Apr 03 '16 at 12:37