13

The function below is designed to apply rel="nofollow" attributes to all external links and no internal links unless the path matches a predefined root URL defined as $my_folder below.

So given the variables...

$my_folder = 'http://localhost/mytest/go/';
$blog_url = 'http://localhost/mytest';

And the content...

<a href="http://localhost/mytest/">internal</a>

<a href="http://localhost/mytest/go/hostgator">internal cloaked link</a>

<a href="http://cnn.com">external</a>

The end result, after replacement should be...

<a href="http://localhost/mytest/">internal</a>

<a href="http://localhost/mytest/go/hostgator" rel="nofollow">internal cloaked link</a>

<a href="http://cnn.com" rel="nofollow">external</a>

Notice that the first link is not altered, since its an internal link.

The link on the second line is also an internal link, but since it matches our $my_folder string, it gets the nofollow too.

The third link is the easiest, since it does not match the blog_url, its obviously an external link.

However, in the script below, ALL of my links are getting nofollow. How can I fix the script to do what I want?

function save_rseo_nofollow($content) {
$my_folder =  $rseo['nofollow_folder'];
$blog_url = get_bloginfo('url');
    preg_match_all('~<a.*>~isU',$content["post_content"],$matches);
    for ( $i = 0; $i <= sizeof($matches[0]); $i++){
        if ( !preg_match( '~nofollow~is',$matches[0][$i])
            && (preg_match('~' . $my_folder . '~', $matches[0][$i]) 
               || !preg_match( '~'.$blog_url.'~',$matches[0][$i]))){
            $result = trim($matches[0][$i],">");
            $result .= ' rel="nofollow">';
            $content["post_content"] = str_replace($matches[0][$i], $result, $content["post_content"]);
        }
    }
    return $content;
}
alex
  • 479,566
  • 201
  • 878
  • 984
Scott B
  • 38,833
  • 65
  • 160
  • 266
  • 4
    I think DOMDocument would be nicer to use with this. – alex Feb 18 '11 at 04:28
  • @alex: dont get me started, LOL. I'm sure it would be but everytime I've tried it, I've got 4 times more code and it never works exactly right. At least I can get the preg_match to work, but it needs some minor tweaks. But I'm not against giving DOMdocument another shot if someone can crack the question with a DOMdocument example that works on the WordPress content editor's post_content object. – Scott B Feb 18 '11 at 04:39
  • Try phpQuery rather than cumbersome DOMDocument. But at this point it should also not go unmentioned that deploying `rel=nofollow` is quite pointless. It does not help with your or anyone elses spam problem. It's just free labour so Google has less work. It's not known to be a deterrant for spambots either. – mario Feb 18 '11 at 04:43
  • @Scott B I posted a DOMDocument solution that works. :) – alex Feb 18 '11 at 05:27
  • @mario I agree that DOMDocument is cumbersome. I might check out this phpQuery sometime soon, thanks for the suggestion :) – alex Feb 18 '11 at 05:50
  • preg_replace_callback does just fine. No need for crazy objects or lots of strpos calls. – Jimmy Ruska Feb 18 '11 at 06:28

9 Answers9

15

Here is the DOMDocument solution...

$str = '<a href="http://localhost/mytest/">internal</a>

<a href="http://localhost/mytest/go/hostgator">internal cloaked link</a>

<a href="http://cnn.com" rel="me">external</a>

<a href="http://google.com">external</a>

<a href="http://example.com" rel="nofollow">external</a>

<a href="http://stackoverflow.com" rel="junk in the rel">external</a>
';
$dom = new DOMDocument();

$dom->preserveWhitespace = FALSE;

$dom->loadHTML($str);

$a = $dom->getElementsByTagName('a');

$host = strtok($_SERVER['HTTP_HOST'], ':');

foreach($a as $anchor) {
        $href = $anchor->attributes->getNamedItem('href')->nodeValue;

        if (preg_match('/^https?:\/\/' . preg_quote($host, '/') . '/', $href)) {
           continue;
        }

        $noFollowRel = 'nofollow';
        $oldRelAtt = $anchor->attributes->getNamedItem('rel');

        if ($oldRelAtt == NULL) {
            $newRel = $noFollowRel;
        } else {
            $oldRel = $oldRelAtt->nodeValue;
            $oldRel = explode(' ', $oldRel);
            if (in_array($noFollowRel, $oldRel)) {
                continue;
            }
            $oldRel[] = $noFollowRel;
            $newRel = implode($oldRel,  ' ');
        }

        $newRelAtt = $dom->createAttribute('rel');
        $noFollowNode = $dom->createTextNode($newRel);
        $newRelAtt->appendChild($noFollowNode);
        $anchor->appendChild($newRelAtt);

}

var_dump($dom->saveHTML());

Output

string(509) "<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<a href="http://localhost/mytest/">internal</a>

<a href="http://localhost/mytest/go/hostgator">internal cloaked link</a>

<a href="http://cnn.com" rel="me nofollow">external</a>

<a href="http://google.com" rel="nofollow">external</a>

<a href="http://example.com" rel="nofollow">external</a>

<a href="http://stackoverflow.com" rel="junk in the rel nofollow">external</a>
</body></html>
"
alex
  • 479,566
  • 201
  • 878
  • 984
11

Try to make it more readable first, and only afterwards make your if rules more complex:

function save_rseo_nofollow($content) {
    $content["post_content"] =
    preg_replace_callback('~<(a\s[^>]+)>~isU', "cb2", $content["post_content"]);
    return $content;
}

function cb2($match) { 
    list($original, $tag) = $match;   // regex match groups

    $my_folder =  "/hostgator";       // re-add quirky config here
    $blog_url = "http://localhost/";

    if (strpos($tag, "nofollow")) {
        return $original;
    }
    elseif (strpos($tag, $blog_url) && (!$my_folder || !strpos($tag, $my_folder))) {
        return $original;
    }
    else {
        return "<$tag rel='nofollow'>";
    }
}

Gives following output:

[post_content] =>
  <a href="http://localhost/mytest/">internal</a>
  <a href="http://localhost/mytest/go/hostgator" rel=nofollow>internal cloaked link</a>    
  <a href="http://cnn.com" rel=nofollow>external</a>

The problem in your original code might have been $rseo which wasn't declared anywhere.

mario
  • 144,265
  • 20
  • 237
  • 291
  • @Mario: Got it. Thanks. How would you encapsulate the rel attribs in quotes? rel="nofollow" to validate? – Scott B Feb 18 '11 at 06:08
  • 1
    @Scott: Use `return "<$tag rel=\"nofollow\">";` or inner single quotes instead. – mario Feb 18 '11 at 06:10
  • @mario: what does the & in front of the arg $content do? function(&$content) – Scott B Feb 18 '11 at 06:19
  • @Scott: Passes a reference. But I just noticed you need an `return` instead. See edit. – mario Feb 18 '11 at 06:21
  • @mario: WP errors on the & in front of content so I removed it. Now get > Warning: array_keys() expects parameter 1 to be array, string given in C:\xampplite\htdocs\mysite\wp-includes\wp-db.php on line 1222 – Scott B Feb 18 '11 at 06:23
  • @Mario: are you returning content or just setting it to a value in save_rseo_nofollow()? Don't I need to return it replaced? – Scott B Feb 18 '11 at 06:25
  • @Scott: See previous edit. WP still needs the `return`. The array content gets updated right before the return. – mario Feb 18 '11 at 06:26
  • @Mario - got it. +1 Works flawlessly so far. Let me put it through some fire. – Scott B Feb 18 '11 at 06:30
  • @Scott: There's one thing that can be optimized still. The regex should be `<(a\s[^>]+)>` or `a\b[` to not trip over any `` tags etc. – mario Feb 18 '11 at 06:33
  • @Mario - the no_follow_folder may or may not be set. If its not set, that branch should be ignored. I'm pulling that variable from a settings array declared outside the function. Should I just pass it in as an arg? I'm calling it with add_filter('wp_insert_post_data', 'save_rseo_nofollow' ); – Scott B Feb 18 '11 at 06:37
  • @Scott: See edit for a possible solution. You cannot pull in arguments there because it's a callback. If it is a configuration variable you could use `global $my_folder;` or something alike. Or simply prepend more `elseif` branches for your special cases. – mario Feb 18 '11 at 06:41
  • @Mario - how to code exception when my_folder is empty (not set)? if/then? Also, $my_folder is absolute, not relative as in your ex. Code change? – Scott B Feb 18 '11 at 06:43
  • @Mario - got it. Flawless but for one exception. When my_folder set but get_option('nofollow') is not. Still want to nofollow internal cloaked, but not external in that case. – Scott B Feb 18 '11 at 06:51
  • 1
    @Scott: Add more `if` blocks then to keep it readable. My support ends here, I'm not writing the whole plugin for you. – mario Feb 18 '11 at 06:53
10

Try this one (PHP 5.3+):

  • skip selected address
  • allow manually set rel parameter

and code:

function nofollow($html, $skip = null) {
    return preg_replace_callback(
        "#(<a[^>]+?)>#is", function ($mach) use ($skip) {
            return (
                !($skip && strpos($mach[1], $skip) !== false) &&
                strpos($mach[1], 'rel=') === false
            ) ? $mach[1] . ' rel="nofollow">' : $mach[0];
        },
        $html
    );
}

Examples:

echo nofollow('<a href="link somewhere" rel="something">something</a>');
// will be same because it's already contains rel parameter

echo nofollow('<a href="http://www.cnn.com">something</a>'); // ad
// add rel="nofollow" parameter to anchor

echo nofollow('<a href="http://localhost">something</a>', 'localhost');
// skip this link as internall link
OzzyCzech
  • 9,713
  • 3
  • 50
  • 34
3

Using regular expressions to do this job properly would be quite complicated. It would be easier to use an actual parser, such as the one from the DOM extension. DOM isn't very beginner-friendly, so what you can do is load the HTML with DOM then run the modifications with SimpleXML. They're backed by the same library, so it's easy to use one with the other.

Here's how it can look like:

$my_folder = 'http://localhost/mytest/go/';
$blog_url = 'http://localhost/mytest';

$html = '<html><body>
<a href="http://localhost/mytest/">internal</a>
<a href="http://localhost/mytest/go/hostgator">internal cloaked link</a>
<a href="http://cnn.com">external</a>
</body></html>';

$dom = new DOMDocument;
$dom->loadHTML($html);

$sxe = simplexml_import_dom($dom);

// grab all <a> nodes with an href attribute
foreach ($sxe->xpath('//a[@href]') as $a)
{
    if (substr($a['href'], 0, strlen($blog_url)) === $blog_url
     && substr($a['href'], 0, strlen($my_folder)) !== $my_folder)
    {
        // skip all links that start with the URL in $blog_url, as long as they
        // don't start with the URL from $my_folder;
        continue;
    }

    if (empty($a['rel']))
    {
        $a['rel'] = 'nofollow';
    }
    else
    {
        $a['rel'] .= ' nofollow';
    }
}

$new_html = $dom->saveHTML();
echo $new_html;

As you can see, it's really short and simple. Depending on your needs, you may want to use preg_match() in place of the strpos() stuff, for example:

    // change the regexp to your own rules, here we match everything under
    // "http://localhost/mytest/" as long as it's not followed by "go"
    if (preg_match('#^http://localhost/mytest/(?!go)#', $a['href']))
    {
        continue;
    }

Note

I missed the last code block in the OP when I first read the question. The code I posted (and basically any solution based on DOM) is better suited at processing a whole page rather than a HTML block. Otherwise, DOM will attempt to "fix" your HTML and may add a <body> tag, a DOCTYPE, etc...

Community
  • 1
  • 1
Josh Davis
  • 28,400
  • 5
  • 52
  • 67
  • Hi, I tried to use your code but the its still adding nofollow to the blog url. any help? – Sisir Dec 09 '11 at 21:41
  • This code helped me, but I did run into an encoding issue when the `$html` string contained utf-8 characters such as curly quotes. Replacing `$dom->loadHTML($html);` with `$dom->loadHTML(mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8'));` fixed the issue. Source of fix: [PHP DOMDocument loadHTML not encoding UTF-8 correctly](http://stackoverflow.com/a/8218649/3059883) – Dave Romsey Sep 21 '16 at 20:50
3

Thanks @alex for your nice solution. But, I was having a problem with Japanese text. I have fixed it as following way. Also, this code can skip multiple domains with the $whiteList array.

public function addRelNoFollow($html, $whiteList = [])
{
    $dom = new \DOMDocument();
    $dom->preserveWhiteSpace = false;
    $dom->loadHTML(mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8'));
    $a = $dom->getElementsByTagName('a');

    /** @var \DOMElement $anchor */
    foreach ($a as $anchor) {
        $href = $anchor->attributes->getNamedItem('href')->nodeValue;
        $domain = parse_url($href, PHP_URL_HOST);

        // Skip whiteList domains
        if (in_array($domain, $whiteList, true)) {
            continue;
        }

        // Check & get existing rel attribute values
        $noFollow = 'nofollow';
        $rel = $anchor->attributes->getNamedItem('rel');
        if ($rel) {
            $values = explode(' ', $rel->nodeValue);
            if (in_array($noFollow, $values, true)) {
                continue;
            }
            $values[] = $noFollow;
            $newValue = implode($values, ' ');
        } else {
            $newValue = $noFollow;
        }

        // Create new rel attribute
        $rel = $dom->createAttribute('rel');
        $node = $dom->createTextNode($newValue);
        $rel->appendChild($node);
        $anchor->appendChild($rel);
    }

    // There is a problem with saveHTML() and saveXML(), both of them do not work correctly in Unix.
    // They do not save UTF-8 characters correctly when used in Unix, but they work in Windows.
    // So we need to do as follows. @see https://stackoverflow.com/a/20675396/1710782
    return $dom->saveHTML($dom->documentElement);
}
biplob
  • 1,252
  • 1
  • 11
  • 29
0
<?

$str='<a href="http://localhost/mytest/">internal</a>
<a href="http://localhost/mytest/go/hostgator">internal cloaked link</a>
<a href="http://cnn.com">external</a>';

function test($x){
  if (preg_match('@localhost/mytest/(?!go/)@i',$x[0])>0) return $x[0];
  return 'rel="nofollow" '.$x[0];
}

echo preg_replace_callback('/href=[\'"][^\'"]+/i', 'test', $str);

?>
Jimmy Ruska
  • 468
  • 2
  • 9
0

Here is the another solution which has whitelist option and add tagret Blank attribute. And also it check if there already a rel attribute before add a new one.

function Add_Nofollow_Attr($Content, $Whitelist = [], $Add_Target_Blank = true) 
{
    $Whitelist[] = $_SERVER['HTTP_HOST'];
    foreach ($Whitelist as $Key => $Link) 
    {
        $Host = preg_replace('#^https?://#', '', $Link);
        $Host = "https?://". preg_quote($Host, '/');
        $Whitelist[$Key] = $Host;
    }

    if(preg_match_all("/<a .*?>/", $Content, $matches, PREG_SET_ORDER)) 
    {
        foreach ($matches as $Anchor_Tag) 
        {
            $IS_Rel_Exist = $IS_Follow_Exist = $IS_Target_Blank_Exist = $Is_Valid_Tag =  false;
            if(preg_match_all("/(\w+)\s*=\s*['|\"](.*?)['|\"]/",$Anchor_Tag[0],$All_matches2)) 
            {
                foreach ($All_matches2[1] as $Key => $Attr_Name)
                {
                    if($Attr_Name == 'href')
                    {
                        $Is_Valid_Tag = true;
                        $Url = $All_matches2[2][$Key];
                        // bypass #.. or internal links like "/"
                        if(preg_match('/^\s*[#|\/].*/', $Url)) 
                        {
                            continue 2;
                        }

                        foreach ($Whitelist as $Link) 
                        {
                            if (preg_match("#$Link#", $Url)) {
                                continue 3;
                            }
                        }
                    }
                    else if($Attr_Name == 'rel')
                    {
                        $IS_Rel_Exist = true;
                        $Rel = $All_matches2[2][$Key];
                        preg_match("/[n|d]ofollow/", $Rel, $match, PREG_OFFSET_CAPTURE);
                        if( count($match) > 0 )
                        {
                            $IS_Follow_Exist = true;
                        }
                        else
                        {
                            $New_Rel = 'rel="'. $Rel . ' nofollow"';
                        }
                    }
                    else if($Attr_Name == 'target')
                    {
                        $IS_Target_Blank_Exist = true;
                    }
                }
            }

            $New_Anchor_Tag = $Anchor_Tag;
            if(!$IS_Rel_Exist)
            {
                $New_Anchor_Tag = str_replace(">",' rel="nofollow">',$Anchor_Tag);
            }
            else if(!$IS_Follow_Exist)
            {
                $New_Anchor_Tag = preg_replace("/rel=[\"|'].*?[\"|']/",$New_Rel,$Anchor_Tag);
            }

            if($Add_Target_Blank && !$IS_Target_Blank_Exist)
            {
                $New_Anchor_Tag = str_replace(">",' target="_blank">',$New_Anchor_Tag);
            }

            $Content = str_replace($Anchor_Tag,$New_Anchor_Tag,$Content);
        }
    }
    return $Content;
}

To use it:

$Page_Content = '<a href="http://localhost/">internal</a>
                 <a href="http://yoursite.com">internal</a>
                 <a href="http://google.com">google</a>
                 <a href="http://example.com" rel="nofollow">example</a>
                 <a href="http://stackoverflow.com" rel="random">stackoverflow</a>';

$Whitelist = ["http://yoursite.com","http://localhost"];

echo Add_Nofollow_Attr($Page_Content,$Whitelist,true);
Mohamad Hamouday
  • 2,070
  • 23
  • 20
0

WordPress decision:

function replace__method($match) {
    list($original, $tag) = $match;   // regex match groups

    $my_folder =  "/articles";       // re-add quirky config here
    $blog_url = 'https://'.$_SERVER['SERVER_NAME'];

    if (strpos($tag, "nofollow")) {
        return $original;
    }
    elseif (strpos($tag, $blog_url) && (!$my_folder || !strpos($tag, $my_folder))) {
        return $original;
    }
    else {
        return "<$tag rel='nofollow'>";
    }
}

add_filter( 'the_content', 'add_nofollow_to_external_links', 1 );

function add_nofollow_to_external_links( $content ) {
    $content = preg_replace_callback('~<(a\s[^>]+)>~isU', "replace__method", $content);
    return $content;
}
Tomerikoo
  • 18,379
  • 16
  • 47
  • 61
-1

a good script which allows to add nofollow automatically and to keep the other attributes

function nofollow(string $html, string $baseUrl = null) {
    return preg_replace_callback(
            '#<a([^>]*)>(.+)</a>#isU', function ($mach) use ($baseUrl) {
                list ($a, $attr, $text) = $mach;
                if (preg_match('#href=["\']([^"\']*)["\']#', $attr, $url)) {
                    $url = $url[1];
                    if (is_null($baseUrl) || !str_starts_with($url, $baseUrl)) {
                        if (preg_match('#rel=["\']([^"\']*)["\']#', $attr, $rel)) {
                            $relAttr = $rel[0];
                            $rel = $rel[1];
                        }
                        $rel = 'rel="' . ($rel ? (strpos($rel, 'nofollow') ? $rel : $rel . ' nofollow') : 'nofollow') . '"';
                        $attr = isset($relAttr) ? str_replace($relAttr, $rel, $attr) : $attr . ' ' . $rel;
                        $a = '<a ' . $attr . '>' . $text . '</a>';
                    }
                }
                return $a;
            },
            $html
    );
}
Redouane
  • 33
  • 6