Remove special characters that mess with formating

Question

I am currently creating a chat and can't seem to find a way to stop users from posting special characters that mess with formatting of the chat and lagging end users out of the chat.

I am basically trying to remove them entirely. I know the code I have right now "technically if it worked" should only replace them, however I was just trying to get this to work first.

Here is the code that I am using to censor/scrape the variables. I thought htmlentities() would do it but does not seem to be working properly.

            $message = $censor->censorString(
            $this->parseUrls(
                htmlentities(
                    strip_tags(
                        $message)
                )
            )
        ); //Stripping $message of profanity, html tags, and special characters

Here is a screenshot of my problem:

`htmlspecialchars` converts special characters to their HTML entities. That doesn't remove them. — GolezTrol, Aug 14 '15 at 23:49
Was testing with that, updated my code with updated problem. Thank you, forgot to change it back. — Andrew Rayner, Aug 14 '15 at 23:50
It's a bit unclear what you want to do. Do you want to escape those characters, remove them, replace them with something else? Calling `htmlspecialchars` twice will show the (escaped) html entity instead of those characters.. — GolezTrol, Aug 14 '15 at 23:52
I want to remove them. Sorry I did not specify. I will add that to my post. — Andrew Rayner, Aug 14 '15 at 23:53
You could use a regex for that. Something like `preg_replace('~[^\w]~', ''` depending on your `special character` requirement. Demo: https://eval.in/417007 `\w` is `a-z`, `A-Z`, `_`, and `0-9`. — chris85, Aug 15 '15 at 00:00

score 4 · Accepted Answer · answered Oct 06 '15 at 13:14

Contrary to many answers you'll find on StackOverflow, it is trivial to sanitize "Zalgo" text with a regex engine that supports matching on Unicode categories. PHP's preg_* functions use the PCRE library. If PCRE is compiled with --enable-unicode-properties, you can strip all Unicode combining marks using:

$sanitized = preg_replace('/\pM/u', '', $zalgo);

Or allow a certain maximum of consecutive combining marks, say one:

$sanitized = preg_replace('/(\pM)\pM+/u', '\1', $zalgo);

Or two:

$sanitized = preg_replace('/(\pM{2})\pM+/u', '\1', $zalgo);

This will turn Zalgo text like

T̫̺̳o̬̜ ì̬͎̲̟nv̖̗̻̣̹̕o͖̗̠̜̤k͍͚̹͖̼e̦̗̪͍̪͍ ̬ͅt̕h̠͙̮͕͓e̱̜̗͙̭ ̥͔̫͙̪͍̣͝ḥi̼̦͈̼v҉̩̟͚̞͎e͈̟̻͙̦̤-m̷̘̝̱í͚̞̦̳n̝̲̯̙̮͞d̴̺̦͕̫ ̗̭̘͎͖r̞͎̜̜͖͎̫͢ep͇r̝̯̝͖͉͎̺e̴s̥e̵̖̳͉͍̩̗n̢͓̪͕̜̰̠̦t̺̞̰i͟n҉̮̦̖̟g̮͍̱̻͍̜̳ ̳c̖̮̙̣̰̠̩h̷̗͍̖͙̭͇͈a̧͎̯̹̲̺̫ó̭̞̜̣̯͕s̶̤̮̩̘.̨̻̪̖͔ ̳̭̦̭̭̦̞́I̠͍̮n͇̹̪̬v̴͖̭̗̖o̸k҉̬̤͓͚̠͍i͜n̛̩̹͉̘̹g͙ ̠̥ͅt̰͖͞h̫̼̪e̟̩̝ ̭̠̲̫͔fe̤͇̝̱e͖̮̠̹̭͖͕l͖̲̘͖̠̪i̢̖͎̮̗̯͓̩n̸̰g̙̱̘̗͚̬ͅ ͍o͍͍̩̮͢f̖͓̦̥ ̘͘c̵̫̱̗͚͓̦h͝a̝͍͍̳̣͖͉o͙̟s̤̞.̙̝̭̣̳̼͟ ̢̻͖͓̬̞̰̦W̮̲̝̼̩̝͖i͖͖͡ͅt̘̯͘h̷̬̖̞̙̰̭̳ ̭̪̕o̥̤̺̝̼̰̯͟ṳ̞̭̤t̨͚̥̗ ̟̺̫̩̤̳̩o̟̰̩̖ͅr̞̘̫̩̼d̡͍̬͎̪̺͚͔e͓͖̝̙r̰͖̲̲̻̠.̺̝̺̟͈ ̣̭T̪̩̼h̥̫̪͔̀e̫̯͜ ̨N̟e҉͔̤zp̮̭͈̟é͉͈ṛ̹̜̺̭͕d̺̪̜͇͓i̞á͕̹̣̻n͉͘ ̗͔̭͡h̲͖̣̺̺i͔̣̖̤͎̯v̠̯̘͖̭̱̯e̡̥͕-m͖̭̣̬̦͈i͖n̞̩͕̟̼̺͜d̘͉ ̯o̷͇̹͕̦f̰̱ ̝͓͉̱̪̪c͈̲̜̺h̘͚a̞͔̭̰̯̗̝o̙͍s͍͇̱͓.̵͕̰͙͈ͅ ̯̞͈̞̱̖Z̯̮̺̤̥̪̕a͏̺̗̼̬̗ḻg͢o̥̱̼.̺̜͇͡ͅ ̴͓͖̭̩͎̗ ̧̪͈̱̹̳͖͙H̵̰̤̰͕̖e̛ ͚͉̗̼̞w̶̩̥͉̮h̩̺̪̩͘ͅọ͎͉̟ ̜̩͔̦̘ͅW̪̫̩̣̲͔̳a͏͔̳͖i͖͜t͓̤̠͓͙s̘̰̩̥̙̝ͅ ̲̠̬̥Be̡̙̫̦h̰̩i̛̫͙͔̭̤̗̲n̳͞d̸ ͎̻͘T̛͇̝̲̹̠̗ͅh̫̦̝ͅe̩̫͟ ͓͖̼W͕̳͎͚̙̥ą̙l̘͚̺͔͞ͅl̳͍̙̤̤̮̳.̢ ̟̺̜̙͉Z̤̲̙̙͎̥̝A͎̣͔̙͘L̥̻̗̳̻̳̳͢G͉̖̯͓̞̩̦O̹̹̺!̙͈͎̞̬ *

into something like

T̫o̬ ì̬nv̖o͖k͍e̦ ̬t̕h̠e̱ ̥ḥi̼v҉e͈-m̷í͚n̝d̴ ̗r̞ep͇r̝e̴s̥e̵n̢t̺i͟n҉g̮ ̳c̖h̷a̧ó̭s̶.̨ ̳I̠n͇v̴o̸k҉i͜n̛g͙ ̠t̰h̫e̟ ̭fe̤e͖l͖i̢n̸g̙ ͍o͍f̖ ̘c̵h͝a̝o͙s̤.̙ ̢W̮i͖t̘h̷ ̭o̥ṳ̞t̨ ̟o̟r̞d̡e͓r̰.̺ ̣T̪h̥e̫ ̨N̟e҉zp̮é͉ṛ̹d̺i̞á͕n͉ ̗h̲i͔v̠e̡-m͖i͖n̞d̘ ̯o̷f̰ ̝c͈h̘a̞o̙s͍.̵ ̯Z̯a͏ḻg͢o̥.̺ ̴ ̧H̵e̛ ͚w̶h̩ọ͎ ̜W̪a͏i͖t͓s̘ ̲Be̡h̰i̛n̳d̸ ͎T̛h̫e̩ ͓W͕ą̙l̘l̳.̢ ̟Z̤A͎L̥G͉O̹!̙ *

Polynomial · Answer 2 · 2015-08-15T00:12:08.360

If you're looking for a quick fix, I would use a regex like this:

$cleanMessage = preg_replace("/[^\x20-\xAD\x7F]/", "", $input_lines);

Or, if you prefer:

$cleanMessage = preg_filter("/[\x20-\xAD\x7F]/", "", $input_lines);

Both of these are identical in functionality. It's up to you which one you want to use.

These remove all characters outside of extended ASCII. This means that "normal" text and the most commonly accented Roman characters will still work, but "zalgo" style text will not. Unfortunately, the side effect is that Arabic, Japanese, Chinese, Cyrillic, etc. will also be stripped as "bad".

There's no trivial way to just prevent the kind of abuse you're seeing, because there are so many Unicode tricks you can use to apply diacritic marks to letters. It'd be a full-time job to attempt to filter them out in a way that didn't affect some language somewhere.

My non-technical advice would be to allow users to report people who post these kinds of messages, so that they can be banned by an administrator.

There's no clear definition of "extended ASCII". Your regex will remove accented characters like `U+010C LATIN CAPITAL LETTER C WITH CARON` which is unacceptable in many languages even if they're based on the Latin alphabet. With a regex engine with decent Unicode support, it's trivial to remove excess Unicode combining marks. — nwellnhof, Oct 06 '15 at 12:48

Remove special characters that mess with formating

2 Answers2

Linked

Related