9

I have a simple message board, let's say: mywebsite.com, that allows users to post their messages. Currently the board makes all links clickable, ie. when someone posts something that starts with:

http://, https://, www., http://www., https://www.

then the script automatically makes them as links (ie. adds the A href.. tag).

THE PROBLEM - there is too much spam. So my idea is to automatically remove the above http|s/www so that these don't become 'clickable links.' HOWEVER, I want to allow posters to link to pages within my site, ie. not to remove http|s/www when the message contains link/s to mywebsite.com.

My idea was to create two arrays:

$removeParts = array('http://', 'https://', 'www.', 'http://www.', 'https://www.');

$keepParts = array('http://mywebsite.com', 'http://www.mywebsite.com', 'www.mywebsite.com', 'http://mywebsite.com', 'https://www.mywebsite.com', 'https://mywebsite.com');

but I don't know how to use them correctly (probably str_replace could work somehow).

Below is an example of $message which is before posting and after posting:

$message BEFORE:

Hello world, thanks to http://mywebsite/about I learned a lot. I found you on http://www.bing.com, https://google.com/search and on some www.spamwebsite.com/refid=spammer2.

$message AFTER:

Hello world, thanks to http://mywebsite.com/about I learned a lot. I found you on bing.com, google.com/search and on some spamwebsite.com/refid=spammer2.


Please note the user enters clear text into the post form, so script should only work with this clear text (not a href etc.).

userlond
  • 3,632
  • 2
  • 36
  • 53
NonCoder
  • 235
  • 4
  • 10
  • Check out this post: http://stackoverflow.com/questions/9364242/how-to-remove-http-www-and-slash-from-url-in-php – nomistic Apr 24 '15 at 23:29
  • Yes, I know how to parse domain from URL, but here a message may contain both regular text and link/s... not just a link. – NonCoder Apr 24 '15 at 23:32
  • Note: the accepted answer on that link provides an answer to that question as well. – nomistic Apr 24 '15 at 23:33

4 Answers4

1
$url = "http://mywebsite/about";
$parse = parse_url($url);

if($parse["host"] == "mywebsite")
    echo "My site, let's mark it as link";

More info: http://php.net/manual/en/function.parse-url.php

Ido
  • 2,034
  • 1
  • 17
  • 16
1

killSpam() function features:

  • works with single and double-quotes.
  • Invalid html
  • ftp://
  • http://
  • https://
  • file://
  • mailto:

function killSpam($html, $whitelist){

//process html links
preg_match_all('%(<(?:\s+)?a.*?href=["|\'](.*?)["|\'].*?>(.*?)<(?:\s+)?/(?:\s+)?a(?:\s+)?>)%sm', $html, $match, PREG_PATTERN_ORDER);
for ($i = 0; $i < count($match[1]); $i++) {
    if(!preg_match("/$whitelist/", $match[1][$i])){
        $spamsite = $match[3][$i];
        $html = preg_replace("%" . preg_quote($match[1][$i]) . "%",  " (SPAM) ", $html);
    }
}

//process cleartext links
preg_match_all('/(\b(?:(?:(?:https?|ftp|file):\/\/|www\.|ftp\.)[A-Z0-9+&@#\/%?=~_|$!:,.;-]*[A-Z0-9+&@#\/%=~_|$-]|((?:mailto:)?[A-Z0-9._%+-]+@[A-Z0-9._%-]+\.[A-Z]{2,6})\b)|"(?:(?:https?|ftp|file):\/\/|www\.|ftp\.)[^"\r\n]+"|\'(?:(?:https?|ftp|file):\/\/|www\.|ftp\.)[^\'\r\n]+\')/i', $html, $match2, PREG_PATTERN_ORDER);

for ($i = 0; $i < count($match2[1]); $i++) {
     if(!preg_match("/$whitelist/", $match2[1][$i])){
        $spamsite = $match2[1][$i];
        $html = preg_replace("%" . preg_quote($spamsite) . "%",  " (SPAM) ", $html);
    }
}


return $html;

}

Usage:

$html = <<< LOB
 <p>Hello world, thanks to <a href="http://mywebsite.com/about" rel="nofollow">http://mywebsite/about</a> I learned a lot. I found
  you on <a href="http://www.bing.com" rel="nofollow">http://www.bing.com</a>, <a href="https://google.com/search" rel="nofollow">https://google.com/search</a> and on some <a href="http://www.spamwebsite.com" rel="nofollow">www.spamwebsite.com/refid=spammer2< /a >. www.spamme.com, http://morespam.com/?aff=122, http://crazyspammer.com/?money=22 and spam@email.com, file://spamfile.com/file.txt ftp://spamftp.com/file.exe </p>
LOB;

$whitelist = "(google\.com|yahoo\.com|bing\.com|nicesite\.com|mywebsite\.com)";

$noSpam = killSpam($html, $whitelist);

echo $noSpam;

Spam Example:

I CANNOT POST THE SPAM HTML HERE, I GUESS SO HAS IS OWN killSpam()...- view it at http://pastebin.com/HXCkFeGn

Hello world, thanks to http://mywebsite/about I learned a lot. I found you on http://www.bing.com, https://google.com/search and on some www.spamwebsite.com/refid=spammer2. www.spamme.com, http://morespam.com/?aff=122, http://crazyspammer.com/?money=22 and spam@email.com, file://spamfile.com/file.txt ftp://spamftp.com/file.exe


Output:

Hello world, thanks to (SPAM) I learned a lot. I found you on http://www.bing.com, https://google.com/search and on some (SPAM) . (SPAM) , (SPAM) , (SPAM) and (SPAM) , (SPAM) (SPAM)


Demo:

http://ideone.com/9IxFrB

Pedro Lobito
  • 94,083
  • 31
  • 258
  • 268
  • Thanks, but please note that the input is clear, ie. user doesn't enter a href etc. so in your example the initial $html is: $html='Hello world, thanks to http://mywebsite/about I learned a lot. I found you on http://www.bing.com, https://google.com/search and on some www.spamwebsite.com/refid=spammer2.'; Would it work with this too? – NonCoder Apr 24 '15 at 23:43
  • You need to create a white list. I'll update the code. – Pedro Lobito Apr 24 '15 at 23:49
  • I think it's better in reverse, ie. my $removeParts / $keepParts could be considered as whitelisted site, that should be easier I hope.. – NonCoder Apr 24 '15 at 23:51
  • Updated 2: matches incorrect html. i.e.: `< \ a >`, `< a ` – Pedro Lobito Apr 25 '15 at 00:06
  • Ok, but the input does not contain a href tags (your code works with a href, rel=nofollow etc). User enters clear text and http/https/www parts should be removed from this clear text only... – NonCoder Apr 25 '15 at 00:09
  • Thanks, but my idea would be to: 1 - find all occurences of http/www and add them to an array. 2 - remove urls related to mywebsite.com from the array. 3 - use str_replace to remove http|www from the urls not belonging to mywebsite.com. Your solution seem to be too complicated.. : ) – NonCoder Apr 25 '15 at 01:30
  • I guess you can go YesCoder for a bit:) – Pedro Lobito Apr 25 '15 at 01:33
0

If u want to preserve text of links, but make them "not clickable", u may try this code:

<?php

$text = <<<__text
   Hello world, thanks to http://mywebsite/about I learned a lot.
   I found you on http://www.bing.com, https://google.com/search and on some www.spamwebsite.com/refid=spammer2.
   www.spamme.com, http://morespam.com/?aff=122, http://crazyspammer.com/?money=22 and spam@email.com, file://spamfile.com/file.txt ftp://spamftp.com/file.exe
__text;
$allowed_domains = ['mywebsite.com'];

$pattern = "/((([A-Za-z]{3,9}:(?:\/\/)?)(?:[\-;:&=\+\$,\w]+@)?[A-Za-z0-9\.\-]+|(?:www\.|[\-;:&=\+\$,\w]+@)[A-Za-z0-9\.\-]+)((?:\/[\+~%\/\.\w\-_]*)?\??(?:[\-\+=&;%@\.\w_]*)#?(?:[\.\!\/\\\w]*))?)/";
preg_match_all($pattern, $text, $matches, PREG_SET_ORDER);
foreach ($matches as $m) {
    list(, $url, $scheme_and_domain, $scheme, $path) = $m;
    $domain = preg_replace(['/^' . preg_quote($scheme, '/') . '/i', "/^www./i"], '', $scheme_and_domain);

    if (in_array($domain, $allowed_domains)) continue;

    $url_prepared = rtrim("$domain$path", '/');
    $text = str_replace($url, $url_prepared, $text);
}

echo $text;

Codepad

userlond
  • 3,632
  • 2
  • 36
  • 53
0

For anyone looking for an answer - I posted a related (more specific) question which solved the problem: PHP - remove words (http|https|www|.com|.net) from string that do not start with specific words

Community
  • 1
  • 1
NonCoder
  • 235
  • 4
  • 10