1

I'm attempting to create a bad word filter in PHP that will search a text, match against an array of known bad words, then replace each character (except the first letter) in the bad word with an asterisk.

Example:

  • fook would become f***
  • shoot would become s****

The only part I don't know is how to keep the first letter in the string, and how to replace the remaining letters with something else while keeping the same string length.

My code is unsuitable because it always replaces the whole word with exactly 3 asterisks.

$string = preg_replace("/\b(". $word .")\b/i", "***", $string);
miken32
  • 42,008
  • 16
  • 111
  • 154
mwieczorek
  • 2,107
  • 6
  • 31
  • 37
  • 1
    depending on the size of word list, string_replace() with arrays would be faster –  Feb 17 '11 at 01:22

5 Answers5

3
$string = 'fook would become';
$word = 'fook';

$string = preg_replace("~\b". preg_quote($word, '~') ."\b~i", $word[0] . str_repeat('*', strlen($word) - 1), $string);

var_dump($string);
zerkms
  • 249,484
  • 69
  • 436
  • 539
  • This solution is not flexible enough to readily receive a blacklist array of bad words. It is hardcoded to only replace a single word. – mickmackusa May 29 '23 at 07:16
1

This can be done in many ways, with very weird auto-generated regexps... But I believe using preg_replace_callback() would end up being more robust

<?php
# as already pointed out, your words *may* need sanitization

foreach($words as $k=>$v)
  $words[$k]=preg_quote($v,'/');

# and to be collapsed into a **big regexpy goodness**
$words=implode('|',$words);


# after that, a single preg_replace_callback() would do

$string = preg_replace_callback('/\b('. $words .')\b/i', "my_beloved_callback", $string);

function my_beloved_callback($m)
{
  $len=strlen($m[1])-1;

  return $m[1][0].str_repeat('*',$len);
}
ZJR
  • 9,308
  • 5
  • 31
  • 38
0
$string = preg_replace("/\b".$word[0].'('.substr($word, 1).")\b/i", "***", $string);
rik
  • 8,592
  • 1
  • 26
  • 21
0

Assuming your blacklist of bad words to be masked are fully comprised of letters or at least of word characters (allowing for digits and underscores), you won't need to call preg_quote() before imploding and inserting into the regex pattern.

Use the \G metacharacter to continue matching after the first letter of a qualifying word is matched. Every subsequently matched letter in the bad word will be replaced 1-for-1 with an asterisk.

\K is used to forget/release the first letter of the bad word.

This approach removes the need to call preg_replace_callback() to measure every matched string and write N asterisks after the first letter of every matches bad word in a block of text.

Breakdown:

/                      #start of pattern delimiter
(?:                    #non-capturing group to encapsulate logic
   \b                  #position separating word character and non-word character
   (?=                 #start lookahead -- to match without consuming letters
      (?:fook|shoot)   #OR-delimited bad words
      \b               #position separating word character and non-word character
   )                   #end lookahead
   \w                  #first word character of bad word
   \K                  #forget first matched word character
   |                   #OR -- to set up \G technique
   \G(?!^)             #continue matching from previous match but not from the start of the string
)                      #end of non-capturing group
\w                     #match non-first letter of bad word
/                      #ending pattern delimiter
i                      #make pattern case-insensitive

Code: (Demo)

$bad = ['fook', 'shoot'];
$pattern = '/(?:\b(?=(?:' . implode('|', $bad) . ')\b)\w\K|\G(?!^))\w/i';

echo preg_replace($pattern, '*', 'Holy fook n shoot, Batman; The Joker\'s shooting The Riddler!');
// Holy f*** n s****, Batman; The Joker's shooting The Riddler!
mickmackusa
  • 43,625
  • 12
  • 83
  • 136
-1

Here is unicode-friendly regular expression for PHP. The regular expression can give you an idea.

function do_something_except_first_letter($s) {
    // the following line SKIP the first character and pass it to callback func...
    // allows to keep the first letter even in words in quotes and brackets.
    // alternative regex is '/(?<!^|\s|\W)(\w)/u'.
    return preg_replace_callback('/(\B\w)/u', function($m) {
            // do what you need...
            // for example, lowercase all characters except the first letter
            return mb_strtolower($m[1]); 
        }, $s);
}
  • This is (at best) the correct answer to another question. This snippet looks like a reinvention of `mb_convert_case($string, MB_CASE_TITLE, 'UTF-8');` Probably more appropriately posted at [Make all words lowercase and the first letter of each word uppercase](https://stackoverflow.com/q/32564539/2943403) – mickmackusa May 29 '23 at 07:10
  • 1
    Furthermore, this answer could be reduced to `/\B\w/u` then access `$m[0]`. – mickmackusa May 29 '23 at 07:30
  • It's correct answer to this question, because I just pointed the idea, the regular expression itself. However, \B is nice addition, thank you, I edited my answer – Aleksey Kuznetsov May 29 '23 at 15:15
  • There is no benefit in using a capturing group. – mickmackusa Jun 01 '23 at 07:30