0

(I'll never understand why things like this isn't a simple, nice function, built into PHP, but rather something which has to be individually researched, often incorrectly, and cobbled together by every single individual programmer, but here goes...)

I do the following to "clean" strings (Unicode) coming from users/external sources:

$string = preg_replace('#[[:cntrl:]]#', '', $string); // Removes all "control characters".
$string = preg_replace('#\p{C}+#u', '', $string); // Removes all "invisible" characters. (As if the control ones above aren't invisible?)

Is this enough? Does this take care of all the abuse-able/weird/special Unicode characters? The whole Unicode thing seems to be a dream for people wanting to be malicious. There's so much weird stuff in that huge set of characters, seemingly impossible for any single person to get a grasp of.

Am I missing something? Maybe there is such a built-in function which does what I do, only better and more complete? If not, why is that? It sometimes feels like I'm the only one concerned with security/control whatsoever...

  • I'm not sure I understand the focus on Unicode. Unicode is just an agreed upon standard for what lookup keys (codepoints) map to what characters; UTF-8 is one standard way of encoding text as Unicode codepoints. There is not much inherently insecure about Unicode, and in fact, your control character replacement is part of ASCII, but also *included* in Unicode. Neither of your lines of code would protect you against XSS if you are prepping user submitted comments for echoing in HTML, or against SQL injections. – Joshua T Jun 18 '19 at 10:04
  • My post doesn't mention a single word about HTML or SQL. It does, however, mention Unicode a number of times. I have no idea how you could possibly get the idea that I'm talking about HTML/SQL? –  Jun 18 '19 at 10:53
  • Because you asked about Unicode in reference to PHP, of which some large attack surfaces are SQL (injections) and HTML output (XSS). You did not specify what is being done with the user input, so I was trying to cover the usual bases. If we are strictly discussing just Unicode, than even the most common exploit, ["visual spoofing"](http://unicode.org/reports/tr36/#visual_spoofing), has little to do with control or invisible characters. – Joshua T Jun 18 '19 at 11:13
  • Anyways, a regex based answer you might be looking for is [this](https://stackoverflow.com/q/1176904), or for a built-in PHP solution, [sanitize filters](https://php.net/manual/en/filter.filters.sanitize.php). – Joshua T Jun 18 '19 at 11:13
  • I don't understand your links, except for the visual spoofing thing, which I consider related but not strictly a technical security issue, unlike things like forcing the whole webpage into backwards text mode, or bypassing a filter for a word by adding invisible chars. That's what I'm trying to secure. Unicode text itself. I want all malicious/stupid chars to be removed. (I do already use "spoofchecker" to check for confusable characters.) I still don't understand what is so unclear about what I asked, though... –  Jun 18 '19 at 11:22
  • As currently written this question is not answerable. Who do you want to protect against what? For example in a chat its completly acceptable to use RTL overrides, invisibles control characters, emoji, whatever; If that chat is programmed properly all this weird stuff should not affect anything more as the paticular text. So please explain your scenario and the problems you got. Else this question could be closed for being to board. – thehennyy Jun 18 '19 at 11:42
  • 1
    `\p{Is_Malicious}`, `\p{Is_Stupid}` matches characters with the malicious/stupid property. – daxim Jun 18 '19 at 13:29
  • @thehennyy I don't want to repeat myself yet again. Please read before replying. –  Jun 18 '19 at 16:32
  • @daxim I assume that's a joke? I'm surprised that this site allows jokes since anything I post seriously is not appreciated/understood... –  Jun 18 '19 at 16:32

0 Answers0