0

On my forum, I want to automatically add rel="nofollow" to links that point to external sites. For instance, someone creates a post with the following text:

Link 1: <a href="http://www.external1.com" target="_blank">External Link 1</A>
Link 2: <a href="http://www.myforum.com">Local Link 1</A>
Link 3: <a href="http://www.external2.com">External Link 2</A>
Link 4: <a href="http://www.myforum.com/test" ALT="Local">Local Link 2</A>

Using Perl, I want that changed to:

Link 1: <a href="http://www.external1.com" target="_blank" rel="nofollow">External Link 1</A>
Link 2: <a href="http://www.myforum.com">Local Link 1</A>
Link 3: <a href="http://www.external2.com" rel="nofollow">External Link 2</A>
Link 4: <a href="http://www.myforum.com/test" ALT="Local">Local Link 2</A>

I can do this using quite a few lines of code, but I was hoping I could do this with one or more regexes. But I can't figure out how.

  • 1
    Do you allow arbitrary HTML in the posts? Or do you use some other markup language like BB code – in that case, augmenting the parser for that might be better. – amon Sep 03 '13 at 22:35
  • I'm not familiar with Perl, really, but would something like: "(.*?href=/"http:////www/.myforum/.com/".*?)(.*?>.*)" be your regex, and then if that matchs, it's local, otherwise it's not. And then if it's not you find "(.*?href=.*?)(>.*)" and replace what you found with the first element + " rel=/"nofollow/"" + the second element? – CamelopardalisRex Sep 03 '13 at 22:47
  • @amon: Yes, the forum uses a markup language but posts are stored in their HTML form. – Clark Ventura Sep 04 '13 at 07:59
  • @AlexBaldwin: That would only work if there's only one link in the string. – Clark Ventura Sep 04 '13 at 08:03
  • @ClarkVentura , You're right. I was trying to make it work with the example given, and didn't pay attention to the real world problem he mentioned. However, if coupled with a regex that simply finds all links, you could then use my old regex on each link found. – CamelopardalisRex Sep 04 '13 at 20:40

2 Answers2

1

Regexes can work in limited scenarios, but you should never use regexes to parse HTML

Every time you attempt to parse HTML with regular expressions, the unholy child weeps the blood of virgins, and Russian hackers pwn your webapp.

    — from RegEx match open tags except XHTML self-contained tags

I am quite fond of the Mojo suite, because this allows us to use a proper parser with very little code. We can the use CSS selectors to find interesting elements:

use strict; use warnings;
use autodie;
use Mojo;
use File::Slurp;

for my $filename (@ARGV) {
  my $dom = Mojo::DOM->new(scalar read_file $filename);

  for my $link ($dom->find('a[href]')->each) {
    $link->attr(rel => 'nofollow')
      if $link->attr('href') !~ m(\Ahttps?://www[.]myforum[.]com(?:/|\z));
  }

  write_file "$filename~", "$dom";
  rename "$filename~" => $filename;
}

Invocation: perl mark-links-as-nofollow.pl *.html A test run on your data produces the output:

Link 1: <a href="http://www.external1.com" rel="nofollow" target="_blank">External Link 1</a>
Link 2: <a href="http://www.myforum.com">Local Link 1</a>
Link 3: <a href="http://www.external2.com" rel="nofollow">External Link 2</a>
Link 4: <a alt="Local" href="http://www.myforum.com/test">Local Link 2</a>

Why did I use tempfiles and rename? On most file systems, a file can be renamed atomically, whereas writing to a file takes some time. So other processes might see a half-written file.

Community
  • 1
  • 1
amon
  • 57,091
  • 2
  • 89
  • 149
0

I'd use a regex gobal and eval flag for callback, eg like so:

#!/usr/bin/perl

use strict;

my $internal_link = qr'href="https?:\/\/(?:www\.)?myforum\.com';

my $html = '
Lorem ipsum
<a href="http://www.external1.com" target="_blank">External Link 1</A>
Lorem ipsum
<a href="http://www.myforum.com">Local Link 1</A>
Lorem ipsum
<a href="http://www.external2.com">External Link 2</A>
Lorem ipsum
<a href="http://www.myforum.com/test" ALT="Local">Local Link 2</A>
';

$html =~ s/<a ([^>]+)>/"<a ". replace_externals($1). ">"/eg;

print $html;

sub replace_externals {
    my ($inner) = @_;
    return $inner =~ $internal_link ? $inner : "$inner rel=\"nofollow\"";
}

Alternatively you can surely use negative look-aheads, but that would just mess up the readability..

ukautz
  • 2,193
  • 1
  • 13
  • 7
  • Ooooooo! I had never worked with evals in regex's before but that looks like it should work beautifully! Thanks! – Clark Ventura Sep 04 '13 at 08:04
  • BTW I had already tried to do it with negative look-aheads, but that didn't work (since it would simply skip over the non-local links until it found a local link and mash the two links together - hard to explain but if you try it, you'll see what I mean). – Clark Ventura Sep 04 '13 at 08:08