36

I have a dictionary of swear words in the database, and the following works great

preg_match_all("/\b".$f."(?:ing|er|es|s)?\b/si",$t,$m,PREG_SET_ORDER);

$t is the input text and simply, $f = preg_quote("punk"); "punk" is from the database dictionary, so at this point in the loop the expression is as follows

preg_match_all("/\bpunk(?:ing|er|es|s)?\b/si",$t,$m,PREG_SET_ORDER);

preg_quote replaces symbols eg. # with \\# so that the expression is escaped, but when the dictionary is checking eg. "F@CK" or "A$$" these symbols are not detected in the input string with the above expression, I have both a$$ and f@ck in the dictionary, but they do not work. If I remove preg_quote() on the word, the regular expression is invalid as these symbols are not escaped.

Any suggestions on how I can detect "a$$" ???

Edit:

So I guess the expression that is not working as intended would be eg.

preg_match_all("/\bf\@ck(?:ing|er|es|s)?\b/si",$t,$m,PREG_SET_ORDER);

Which should find f@ck in $t

UPDATE:

This is my usage, simply put; if there are matches in $m replace them with "\*\*\*\*", this whole block is inside a loop through each word in the dictionary, $f is the dictionary word and $t is the input

$f = preg_quote($f);
preg_match_all("/\b$f(?:ing|er|es|s)?\b/si",$t,$m,PREG_SET_ORDER);
if (count($m) > 0) {
     $t = preg_replace("/(\b$f(?:ing|er|es|s)?\b)/si","\*\*\*\*\*",$t);
}

UPDATE: Behold, the var_dump:

preg_quote($f) = string(5) "a\$\$"
$t = string(18) "You're such an a$$"
expression = string(29) "/\ba\$\$(?:ing|er|es|s)?\b/si"

UPDATE: This is only happening when words end with a symbol. I tested "a$$hole" and it’s fine, but "a$$" doesn't work.

ANOTHER UPDATE: Try this simplified version, $words being a make-shift dictionary

$words = array("a$$","asshole","a$$hole","f@ck","f#ck","f*ck");
$text = "Input whatever you feel like here eg. a$$";

foreach ($words as $f) {
   $f = preg_quote($f,"/");
   $text = preg_replace("/\b".$f."(?:ing|er|es|s)?\b/si",
                         str_repeat("*",strlen($f)),
                        $t);
}

I should expect to see "Input whatever you feel like here eg. \*\*\*" as a result.

tchrist
  • 78,834
  • 30
  • 123
  • 180
Prof
  • 2,898
  • 1
  • 21
  • 38
  • Can you include how you are using `preg_quote()` in your example code? –  May 23 '11 at 11:46
  • $f = preg_quote($f); like that :) – Prof May 23 '11 at 11:48
  • Your code works fine for me. Can you show us the string you're testing it on? Or maybe show us the whole dictionary cycle code, maybe the problem isn't in preg_match_all and preg_quote... – Slava May 23 '11 at 12:00
  • 4
    Reminds me of the [Scunthorpe problem](http://en.wikipedia.org/wiki/Scunthorpe_problem). – Gumbo May 23 '11 at 12:12
  • @Gumbo Yeah, not worried about incorrectly finding profanity in eg. assess, that's totally the client's problem, just need it to work :) Besides, the \b makes sure we're talking about full words here – Prof May 23 '11 at 12:14
  • don't forget to set the second param for preg_quote - the delimitating character. In your case, '/' – Aaria Carter-Weir May 23 '11 at 12:16
  • Thanks @skippychalmers, didn't know about that but it hasn't helped :) – Prof May 23 '11 at 12:19
  • 1
    @Prof83 pants. Hmm. Why are you using preg_match and preg_replace? Can you not just use preg_replace and compare strings before and after to determine if anything was matched? – Aaria Carter-Weir May 23 '11 at 12:28
  • Var_dump out this: "/\b$f(?:ing|er|es|s)?\b/si" after you've preg_quote'd $f. – Aaria Carter-Weir May 23 '11 at 12:31
  • ...and var_dump your `$t` too – Slava May 23 '11 at 12:43
  • @Skippy, because the block of code does a lot more :) and felt it unneccessary to post it all, so pretend its just preg_replace – Prof May 23 '11 at 12:45
  • This is only happening when words end with a symbol. I tested "a$$hole" and its fine, but "a$$" doesn't work – Prof May 23 '11 at 13:00
  • 1
    This is not possible. See my answer for why. Ass for "a$$\b" not working, remember that that is asserting that the dollar sign has a word character following it. – tchrist May 23 '11 at 15:53
  • @Prof83 var dump the preg_quote output. – Aaria Carter-Weir May 24 '11 at 17:16
  • 1
    You'd be better off training a bayesian filter to categorise postings as "good" or "bad" based on the words and characters in the posting. Then make it so bad postings don't instantly get posted but require a review. Use of unusual unicode characters would then end up being flagged as likely bad postings. – Matthew Lock Jun 02 '14 at 02:57

3 Answers3

191

Cannot Be Done

I'm sorry, but this “problem” is truly impossible to solve. Consider these:

  • ꜰᴜᴄᴋ   is U+A730.1D1C.1D04.1D0B, "\N{LATIN LETTER SMALL CAPITAL F}\N{LATIN LETTER SMALL CAPITAL U}\N{LATIN LETTER SMALL CAPITAL C}\N{LATIN LETTER SMALL CAPITAL K}"
  • ᶠᵘᶜᵏ   is U+1DA0.1D58.1D9C.1D4F, "\N{MODIFIER LETTER SMALL F}\N{MODIFIER LETTER SMALL U}\N{MODIFIER LETTER SMALL C}\N{MODIFIER LETTER SMALL K}"
  •   is U+1D4BB.1D4CA.1D4B8.1D4C0, "\N{MATHEMATICAL SCRIPT SMALL F}\N{MATHEMATICAL SCRIPT SMALL U}\N{MATHEMATICAL SCRIPT SMALL C}\N{MATHEMATICAL SCRIPT SMALL K}"
  •   is U+1D58B.1D59A.1D588.1D590, "\N{MATHEMATICAL BOLD FRAKTUR SMALL F}\N{MATHEMATICAL BOLD FRAKTUR SMALL U}\N{MATHEMATICAL BOLD FRAKTUR SMALL C}\N{MATHEMATICAL BOLD FRAKTUR SMALL K}"
  •   is U+1D4D5.1D4B0.1D49E.1D4A6, "\N{MATHEMATICAL BOLD SCRIPT CAPITAL F}\N{MATHEMATICAL SCRIPT CAPITAL U}\N{MATHEMATICAL SCRIPT CAPITAL C}\N{MATHEMATICAL SCRIPT CAPITAL K}"
  • ⓕ ⓤ ⓒ ⓚ   is U+24D5.24E4.24D2.24DA, "\N{CIRCLED LATIN SMALL LETTER F}\N{CIRCLED LATIN SMALL LETTER U}\N{CIRCLED LATIN SMALL LETTER C}\N{CIRCLED LATIN SMALL LETTER K}"
  • Γ̵ᏟᏦ   is U+393.335.10335.13DF.13E6, "\N{GREEK CAPITAL LETTER GAMMA}\N{COMBINING SHORT STROKE OVERLAY}\N{GOTHIC LETTER QAIRTHRA}\N{CHEROKEE LETTER TLI}\N{CHEROKEE LETTER TSO}"
  • ƒμɕѤ   is U+192.3BC.255.464, "\N{LATIN SMALL LETTER F WITH HOOK}\N{GREEK SMALL LETTER MU}\N{LATIN SMALL LETTER C WITH CURL}\N{CYRILLIC CAPITAL LETTER IOTIFIED E}"
  • Г̵ЦСК   is U+413.335.426.421.41A, "\N{CYRILLIC CAPITAL LETTER GHE}\N{COMBINING SHORT STROKE OVERLAY}\N{CYRILLIC CAPITAL LETTER TSE}\N{CYRILLIC CAPITAL LETTER ES}\N{CYRILLIC CAPITAL LETTER KA}"
  • ғᵾȼƙ   is U+493.1D7E.23C.199, "\N{CYRILLIC SMALL LETTER GHE WITH STROKE}\N{LATIN SMALL CAPITAL LETTER U WITH STROKE}\N{LATIN SMALL LETTER C WITH STROKE}\N{LATIN SMALL LETTER K WITH HOOK}"
  • ϜυϚΚ   is U+3DC.3C5.3DA.39A, "\N{GREEK LETTER DIGAMMA}\N{GREEK SMALL LETTER UPSILON}\N{GREEK LETTER STIGMA}\N{GREEK CAPITAL LETTER KAPPA}"
  • ЖↃUᆿ   is U+416.2183.55.11BF, "\N{CYRILLIC CAPITAL LETTER ZHE}\N{ROMAN NUMERAL REVERSED ONE HUNDRED}\N{LATIN CAPITAL LETTER U}\N{HANGUL JONGSEONG KHIEUKH}"
  • ʞɔnɟ   is U+29E.254.6E.25F, "\N{LATIN SMALL LETTER TURNED K}\N{LATIN SMALL LETTER OPEN O}\N{LATIN SMALL LETTER N}\N{LATIN SMALL LETTER DOTLESS J WITH STROKE}"

It Gets Worse

And if you think those are easy, just try coping with all of these:

 00 Ↄ ʞ, F ᵾ ⒞ K, K ⓒ Ц ⒡ ,   K , ғ ∞ Ϛ k, f  Ꮯ K, ⓕ oo ɔ ⓚ , ɟ ⒰ ¢ K,   ȼ ,  Ù ȼ ⒦ , f  ⒞ ƙ, F  ᶜ , F ∞  Ж ,  @ Ꮯ , ɟ ᵘ  , F Ц ¢ , f oo Ꮯ ʞ,  oo ¢ Ж ,  υ ᶜ Κ , Ϝ ú * ʞ, ꜰ  c K, ƒ ᵘ ȼ k,  U ȼ , Ж ɔ μ ƒ, F ⓤ ⒞ k, ƒ  C ƙ, ғ 00 ɔ Ѥ, ƒ U c ᴋ,  ∞ Ꮶ ⓒ , ꜰ  ᴄ ⒦ ,  ⒰ Ꮯ Ѥ, ꜰ ᴜ  ⒦ , F   ʞ, f 00  , ғ u С K, f  ɔ Κ , f μ Ↄ K, ɟ  c ʞ, f  Ↄ , F μ ¢ , ᆿ  ᴄ ⒦ , Κ ¢ oo ɟ, ᶠ μ ᶜ Ѥ, ᶠ ⓤ Ꮯ Ж ,  ⒞ ᵘ F, F @ C ⓚ , Ѥ ᴄ u F, ⒡ ᵾ C k, ƒ μ ᶜ ᴋ, F  C , f ᵘ ¢ ᵏ, ᆿ 00  , ꜰ υ ȼ K, Ϝ  ȼ К ,  oo ɕ ᴋ, ғ  Ꮯ ᴋ, ꜰ n  K, ꜰ μ Ϛ К , F ∞ ȼ , ⒡  Ↄ Κ , ƒ  ⒞ , ᶠ U C Ꮶ, ᶠ υ Ↄ ƙ,   C , Ϝ U  Ѥ, Ϝ U Ↄ ,  U ⒞ ᵏ, F @ C К , ғ ᴜ  ᴋ, ⒡ U  К , ɟ U * ᵏ,  Ц c Κ , ғ U Ↄ , ƒ ⒰  ᵏ, ғ  * K,  n  ⓚ , ᶠ 00 С К ,  Ц  k, ƙ c Ц ᶠ,  ⒰ Ѥ , ꜰ ǔ ᴄ ⒦ , F  Ↄ ,   υ ꜰ,   * ᵏ,  00  Ж , Κ C  , ᶠ U С K, ꜰ   Κ , ɟ U ᶜ ⓚ ,  ∞ ȼ ᴋ, ƒ U К ć, ƒ υ ȼ ᴋ, ⒡ ∞ Ж ɕ,  ᵘ  ᵏ, F U Ϛ ʞ, ⓕ   Ж ,    Ↄ, Ϝ n * K,  oo c ⓚ , ƒ U ¢ ʞ, ƒ u C ʞ, K ¢ μ ⒡ , ɟ ⒰ K ɔ, F U c k, F Ц  ⓚ ,  U ᴋ ɔ,   Ꮯ ,    ⓚ , ⓕ  C К , ɟ ᵾ * ⒦ , ᶠ ᵘ ⒞ ⒦ , ƒ ⒰ ᴄ ᵏ, ⒡ ⒰ С K,  ⒰ * ᴋ, ᆿ ∞ ʞ ɕ,  n * Ѥ, Ϝ μ ᴄ , k ć ᵘ ƒ,  ᵘ ɕ , ɟ Ц Ꮶ ᴄ,  ᵾ ⒞ ᵏ, ғ ᵘ  ᵏ,  ᵾ * Ѥ, F  Ꮯ K, ғ ⓤ  ᴋ, ƒ u ɕ , ƙ c ⒰ F,   ⓒ Κ , K ᶜ Ц , ɟ  c ⒦ , ƒ @ c Κ , Ϝ Ц ȼ Ḱ, ⒡ ᵘ  ⒦ , ɟ ᵾ Ѥ ¢, F  Ↄ , Ϝ ᴜ  , Ϝ  ⒞ ,  U Ꮯ ʞ, ƒ υ Ꮯ ᵏ, F ᵾ Ꮯ Κ , Ϝ ᵘ ⓒ ʞ,  ⓤ ᶜ ƙ, ᆿ  ⒞ , f  Ↄ Ѥ,  U  K, Ϝ ᴜ * , ꜰ @ ⓒ ʞ, ƒ u ⓒ , f U ⒞ k,  00 ᴄ Ѥ,  υ С K, F ᴜ ᴄ , ⓕ oo Ↄ ⓚ , ⒡ ᵘ ɕ , ⓕ υ ᴄ Κ , ᆿ U Ꮯ ,   Ꮯ Ꮶ,   Ć ,  Ц ɕ К , f @ Ↄ ⓚ , ᴋ ᶜ U ꜰ,  ᴜ c ⒦ , F ᵘ C ,  00  Ꮶ, ꜰ 00  К , Ϝ  Ϛ ᵏ, F  c Ѥ, ⓕ oo Ↄ K, f ᵾ С ᵏ, ⓕ Ц c ,   c Ж , ⓕ   ƙ, ⓚ C n ғ, ɟ U ȼ ,  00 K ȼ,   ᴄ ,  Ц C ,  Ц ¢ , Ϝ ᵘ c k, ⒡  ¢ k, ƒ ⓤ ⓚ Ↄ,    k, ƒ U Ↄ K,   ᴄ Ꮶ, ᆿ ⓤ  ⒦ , Ж ɔ U , ƒ υ * ᴋ, ƒ   k,  U С ⒦ ,   C Ж , ƒ μ Ꮯ ƙ, ⓕ n ᴄ ⒦ , ⓕ μ ⓒ Ж , ⒡ 00 ɕ ,  ᴜ ᶜ , ᆿ Ù Ж , ⒦ ȼ U , k C ⓤ ᆿ, Ϝ n ȼ ᵏ, ᴋ ȼ ᵾ ɟ, F  ȼ Ѥ, ғ ⒰ ȼ , f U Ж ⒞ , F ῠ  ᵏ, F u  Κ , F 00 ȼ , ꜰ μ Ϛ Ꮶ, ᆿ   K, ⒡ n Ↄ Ж , F @  ƙ, ᶠ ὺ  К ,  U C ᵏ, F U  ⒦ ,  00 Ↄ , ᶠ  c К , ғ ⓤ  ,  ⓤ  Κ ,  U  Ж , ⒡  ɔ Ꮶ, ⓚ ɔ  f,  U C K, F @ C Ѥ, ғ ᴜ С k, ɟ u * ƙ, ⓕ ᵾ ɕ ,  00 ȼ K,  υ  , ƒ ⒰ * ʞ, ⓕ U Ↄ Ж , ꜰ U ȼ ƙ, ⒡ u С ⒦ , ꜰ ᴜ  Ќ, ᆿ μ  ⒦ , ⓕ @ ᴄ К , ᶠ υ ɔ ᵏ, ƙ Ↄ oo ꜰ, F ᴜ  ,  ⒰ C ᵏ,  U  ƙ, ƒ ∞ C Ꮶ,  ⒰ * K,  u Ↄ ᴋ, ᆿ U ⓒ , ᆿ U Ꮶ ,  n  , ƒ Ц C ƙ, ⒦   ꜰ, K ¢ ᵘ f,  ⒰  Ꮶ,  ᴄ 00 , Ϝ U  k,  u ¢ ⒦ ,   * Ѥ, ƒ  С ᴋ,   C Ꮶ,  @  Κ , ʞ С  ᶠ,  ᵾ Ϛ Ꮶ, ᶠ ⒰ ɔ , F Ц ⒞ ʞ, ⒡ ⒰ К ɔ, ɟ υ ¢ , Ѥ ȼ U ᆿ,  ᴜ Ↄ ʞ, ғ  * K,   ᴄ ʞ, F   ʞ,  @ ȼ ,  ⒰ * ,  ᵾ ȼ , F  ¢ Ѥ, ꜰ ⓤ ƙ Ϛ, ⓕ 00 c ʞ,  00 Ϛ K,  υ Ↄ Κ , ꜰ μ ⓒ Ж ,  ᵘ Ϛ ʞ, Ϝ ᵘ Ↄ ᵏ, ⒡ ᵾ Ꮯ , Ϝ ⒰ ȼ Ѥ, ƒ n  Ѥ, ᆿ μ ⓒ k,  Ц ɕ Κ , ғ μ  Ѥ, f ⓤ Ꮯ , ᵏ  μ ƒ, ᵏ С  , ᆿ ∞  , ғ ᵘ Ꮯ , ƒ μ Ↄ k, f oo K ȼ, ɟ   С , ꜰ n  K,  00  ᵏ, ᶠ μ ⓒ ,  c ∞ Ϝ, ᆿ Ц Ć ⒦ ,  ᵘ ᴄ , F 00  ⓚ , ᶠ @ ȼ К , ...

And that’s not all: there are at least a bazingatillion more where those came from. Do you see now why this fundamentally cannot be done?

Full Disclosure

Because I don't believe in security through obscurity, here's the program that generates all those:

#!/usr/bin/env perl
#
# unifuck - print infinite permutations of fuck in unicode aliases
#
# Tom Christiansen <tchrist@perl.com>
# Mon May 23 09:37:27 MDT 2011

use strict;
use warnings;
use charnames ":full";

use Unicode::Normalize;

binmode(STDOUT, ":utf8");

our(@diddle, @fuck, %fuck); # initted down below
while (my($f,$u,$c,$k) = splice(@fuck, 0, 4)) {
    $fuck{F}{$f}++;
    $fuck{U}{$u}++;
    $fuck{C}{$c}++;
    $fuck{K}{$k}++;
} 

my @F = keys %{ $fuck{F} };
my @U = keys %{ $fuck{U} };
my @C = keys %{ $fuck{C} };
my @K = keys %{ $fuck{K} };

while (1) { 
    my $f = $F[rand @F];
    my $u = $U[rand @U];
    my $c = $C[rand @C];
    my $k = $K[rand @K];

    for ($f,$u,$c,$k) {  
        next if length > 1;
        next if /\p{EA=W}/;
        next if /\pM/;
        next if /\p{InEnclosedAlphanumerics}/;
        s/$/$diddle[rand @diddle]/          if rand(100) < 15;
        s/$/\N{COMBINING ENCLOSING KEYCAP}/ if rand(100) <  1;
    }

    if    (             0) {                                       }
    elsif (rand(100) <  5) {     $u        = q(@)                  } 
    elsif (rand(100) <  5) {        $c     = q(*)                  } 
    elsif (rand(100) < 10) {       ($c,$k) = ($k,$c)               } 
    elsif (rand(100) < 15) { ($f,$u,$c,$k) = reverse ($f,$u,$c,$k) }

    print NFC("$f $u $c $k\n");
}

BEGIN {

    # ok to have repeats in each position, since they'll be counted only once
    # per unique strings
    @fuck = (

        "\N{LATIN CAPITAL LETTER F}",
        "\N{LATIN CAPITAL LETTER U}",
        "\N{LATIN CAPITAL LETTER C}",
        "\N{LATIN CAPITAL LETTER K}",

        "\N{LATIN SMALL LETTER F}",
        "\N{LATIN SMALL LETTER U}",
        "\N{LATIN SMALL LETTER C}",
        "\N{LATIN SMALL LETTER K}",

        "\N{LATIN SMALL LETTER F}",
        "\N{INFINITY}",
        "\N{LATIN SMALL LETTER C}",
        "\N{LATIN SMALL LETTER K}",

        "\N{LATIN SMALL LETTER F}",
        "\N{LATIN SMALL LETTER O}\N{LATIN SMALL LETTER O}",
        "\N{LATIN SMALL LETTER C}",
        "\N{KELVIN SIGN}",

        "\N{LATIN SMALL LETTER F}",
        "\N{DIGIT ZERO}\N{DIGIT ZERO}",
        "\N{CENT SIGN}",
        "\N{LATIN CAPITAL LETTER K}",

        "\N{LATIN LETTER SMALL CAPITAL F}",
        "\N{LATIN LETTER SMALL CAPITAL U}",
        "\N{LATIN LETTER SMALL CAPITAL C}",
        "\N{LATIN LETTER SMALL CAPITAL K}",

        "\N{MODIFIER LETTER SMALL F}",
        "\N{MODIFIER LETTER SMALL U}",
        "\N{MODIFIER LETTER SMALL C}",
        "\N{MODIFIER LETTER SMALL K}",

        "\N{MATHEMATICAL SCRIPT SMALL F}",
        "\N{MATHEMATICAL SCRIPT SMALL U}",
        "\N{MATHEMATICAL SCRIPT SMALL C}",
        "\N{MATHEMATICAL SCRIPT SMALL K}",

        "\N{MATHEMATICAL BOLD FRAKTUR CAPITAL F}",
        "\N{MATHEMATICAL BOLD FRAKTUR CAPITAL U}",
        "\N{MATHEMATICAL BOLD FRAKTUR CAPITAL C}",
        "\N{MATHEMATICAL BOLD FRAKTUR CAPITAL K}",

        "\N{MATHEMATICAL BOLD FRAKTUR SMALL F}",
        "\N{MATHEMATICAL BOLD FRAKTUR SMALL U}",
        "\N{MATHEMATICAL BOLD FRAKTUR SMALL C}",
        "\N{MATHEMATICAL BOLD FRAKTUR SMALL K}",

        "\N{MATHEMATICAL BOLD SCRIPT CAPITAL F}",
        "\N{MATHEMATICAL SCRIPT CAPITAL U}",
        "\N{MATHEMATICAL SCRIPT CAPITAL C}",
        "\N{MATHEMATICAL SCRIPT CAPITAL K}",

        "\N{CIRCLED LATIN SMALL LETTER F}",
        "\N{CIRCLED LATIN SMALL LETTER U}",
        "\N{CIRCLED LATIN SMALL LETTER C}",
        "\N{CIRCLED LATIN SMALL LETTER K}",

        "\N{PARENTHESIZED LATIN SMALL LETTER F}",
        "\N{PARENTHESIZED LATIN SMALL LETTER U}",
        "\N{PARENTHESIZED LATIN SMALL LETTER C}",
        "\N{PARENTHESIZED LATIN SMALL LETTER K}",

        "\N{GREEK CAPITAL LETTER GAMMA}\N{COMBINING SHORT STROKE OVERLAY}",
        "\N{GOTHIC LETTER QAIRTHRA}",
        "\N{CHEROKEE LETTER TLI}",
        "\N{CHEROKEE LETTER TSO}",

        "\N{LATIN SMALL LETTER F WITH HOOK}",
        "\N{GREEK SMALL LETTER MU}",
        "\N{LATIN SMALL LETTER C WITH CURL}",
        "\N{CYRILLIC CAPITAL LETTER IOTIFIED E}",

        "\N{CYRILLIC CAPITAL LETTER GHE}\N{COMBINING SHORT STROKE OVERLAY}",
        "\N{CYRILLIC CAPITAL LETTER TSE}",
        "\N{CYRILLIC CAPITAL LETTER ES}",
        "\N{CYRILLIC CAPITAL LETTER KA}",

        "\N{CYRILLIC SMALL LETTER GHE WITH STROKE}",
        "\N{LATIN SMALL CAPITAL LETTER U WITH STROKE}",
        "\N{LATIN SMALL LETTER C WITH STROKE}",
        "\N{LATIN SMALL LETTER K WITH HOOK}",

        "\N{GREEK LETTER DIGAMMA}",
        "\N{GREEK SMALL LETTER UPSILON}",
        "\N{GREEK LETTER STIGMA}",
        "\N{GREEK CAPITAL LETTER KAPPA}",

        "\N{HANGUL JONGSEONG KHIEUKH}",
        "\N{LATIN CAPITAL LETTER U}",
        "\N{ROMAN NUMERAL REVERSED ONE HUNDRED}",
        "\N{CYRILLIC CAPITAL LETTER ZHE}",

        "\N{LATIN SMALL LETTER DOTLESS J WITH STROKE}",
        "\N{LATIN SMALL LETTER N}",
        "\N{LATIN SMALL LETTER OPEN O}",
        "\N{LATIN SMALL LETTER TURNED K}",

        "\N{FULLWIDTH LATIN CAPITAL LETTER F}",
        "\N{FULLWIDTH LATIN CAPITAL LETTER U}",
        "\N{FULLWIDTH LATIN CAPITAL LETTER C}",
        "\N{FULLWIDTH LATIN CAPITAL LETTER K}",

    );

    @diddle = (
        "\N{COMBINING GRAVE ACCENT}",
        "\N{COMBINING ACUTE ACCENT}",
        "\N{COMBINING CIRCUMFLEX ACCENT}",
        "\N{COMBINING TILDE}",
        "\N{COMBINING BREVE}",
        "\N{COMBINING DOT ABOVE}",
        "\N{COMBINING DIAERESIS}",
        "\N{COMBINING CARON}",
        "\N{COMBINING CANDRABINDU}",
        "\N{COMBINING INVERTED BREVE}",
        "\N{COMBINING GRAVE TONE MARK}",
        "\N{COMBINING ACUTE TONE MARK}",
        "\N{COMBINING GREEK PERISPOMENI}",
        "\N{COMBINING FERMATA}",
        "\N{COMBINING SUSPENSION MARK}",
    );

}
tchrist
  • 78,834
  • 30
  • 123
  • 180
  • 42
    I remember using all sorts of Unicode tricks to get around profanity filters a few years ago, and getting banned anyway. Good times. – BoltClock Oct 17 '11 at 22:06
  • 16
    Now, if only SO's gods would read and understand this answer, and stop the stupid censoring. – sbi Mar 30 '12 at 07:11
  • 3
    @sbi well, until then, we can still use our cyrillic letters when our problemmas really need sоlving. Here, take one: рҏѓґоьꙑӏеҽҿӗӎ; neither of р,о,е looks any different from their latin homoglyphs. – John Dvorak Jun 30 '13 at 06:38
  • 13
    I don't give a... You already give all of them. – nanofarad Oct 28 '13 at 22:37
  • Even in ascii you can substitute "ph" for "f" etc. – Matthew Lock Jun 02 '14 at 03:00
  • Just so it's said, someone who considers profanity to be that harmful (and has the time and patience -- and/or a big enough army of censors -- to examine every one of the million or so possible Unicode code points and billions of combiniations of accents) could conceivably build a list of characters that each character "looks like" and/or "sounds like" or "masks". It'd be possible to eliminate every one of the variants listed here. It'd be outrageously tedious, though, and not worth it unless/until certain collections of letters are truly threatening civilization as we know it. – cHao Jun 02 '14 at 15:06
  • 3
    @cHao Actually, this is not as hard as you think, given the existence of *confusables.txt, confusablesSummary.txt,* and *confusablesWholeScript.txt* from [Unicode Technical Report #36: “Unicode Security Considerations”](http://www.unicode.org/reports/tr36/). – tchrist Jun 02 '14 at 15:25
  • @tchrist: Those lists do help with very-similar (combinations of )?characters, but that's a tiny subset of the problem a profanity filter would have to deal with. They won't help much with `ⓕ⒰cʞ`, for example, considering the characters aren't exactly "confusable". (A quick glance fails to find `ⓕ` and `ʞ`, for example, probably because the likelihood of someone legitimately misreading one as the other is slim.) After replacement, you'd end up with `ⓕ(u)cʞ`, which would likely get past any filter that doesn't either do additional translation or arbitrarily block everything outside of ASCII. – cHao Jun 02 '14 at 19:10
  • actually it **can be done**, I know, beacause I did it, 7 years before this question was written. Not in PHP of course, and it is more involved than a regex, but it is a real-time algorithm that was 100% effective for the words it was trained to exclude, all alternative phonetic spellings were caught. –  Apr 23 '18 at 16:02
4

\b checks for a word boundary. According to http://www.regular-expressions.info/wordboundaries.html:

There are three different positions that qualify as word boundaries:

  • Before the first character in the string, if the first character is a word character.
  • After the last character in the string, if the last character is a word character.
  • Between two characters in the string, where one is a word character and the other is not a word character.

"Word characters" are letters, digits, and underscores, so in the string "a$$", the word boundary occurs after the "a", not after the second "$".

You will probably need to explicitly specify the characters you consider to be "word boundaries" by using a class (e.g., [- '"]).

  • I need to provide you a better output result, its still not working despite how promising your answer sounds, is there a way to get you the dictionary and class i am working with? – Prof May 23 '11 at 14:46
  • Add a snipt.org URL to your OP. –  May 23 '11 at 15:00
  • 3
    He may improve his patterns, but he'll never solve this problem. It cannot be done: see my answer for why. – tchrist May 23 '11 at 16:06
2

Now, when you said that it doesn't work at the end of the word I see the problem. $@ or any other such special characters aren't part of the word (so \b breaks the word after 'a' in case of 'a$$' if it isn't followed by any other letters in the input string). I suggest using [^a-z] to mark the end of the word to fix it.

preg_match_all("/\b".$f."(?:ing|er|es|s)?[^a-z]/si",$t,$m,PREG_SET_ORDER);
Slava
  • 2,040
  • 15
  • 15
  • I need to provide you a better output result, its still not working despite how promising your answer sounds, is there a way to get you the dictionary and class i am working with? – Prof May 23 '11 at 14:46
  • It's easy to give bazingatillions of strings that will sneak past this approach. It is doomed to fail. – tchrist May 23 '11 at 16:07
  • Ok but hangon, you're telling me it's impossible to preg_replace("a$$","***","you a$$"); ??? That doesnt sound right to me, i am not trying to find characters that resemble an "S", i am trying to run off a given set of words in the dictionary, if someone posts "a##hole" and its not in the dictionary, then we'll add it into the dictionary??? – Prof May 23 '11 at 16:36