The reason to check for invalid UTF-8, convert single less than signs, and strips octets for security concerns

Question

I'm searching about sanitizing user input text-area field on Wordpress.

I found several sanitizing functions, but there's are some different between functions.

I wonder the one of sanitizing function' feature, sanitize_text_field( string $str )

First of all, I wonder the reason "Checks for invalid UTF-8" Why Invalid UTF-8 to be sanitized?

Second, I would like to reason to converts single < characters to entities.

Third, The reason for "Strips octets"

Thank you for your help in advance!

Does this answer your question? [What are the best PHP input sanitizing functions?](https://stackoverflow.com/questions/3126072/what-are-the-best-php-input-sanitizing-functions) — Your Common Sense, May 02 '20 at 10:19
@YourCommonSense Thank you for your link :) I appreciate it. I just wanted to know regarding sanitizing input field technique used in Wordpress :) — hiyo, May 02 '20 at 11:43

score 1 · Accepted Answer · answered May 02 '20 at 21:45

I'm not a fan of the term "input sanitization". Input sanitization is a misleading term that indicates that you can wave a magic wand at all data and make it "safe data". The problem is that the definition of "safe" changes when the data is interpreted by different pieces of software as do the encoding requirements. Similarly the concept of "valid" data varies depending on context - your data may very well require special characters (',",&,<) - note that SO allows all of these as data.

Output that may be safe to be embedded in an SQL query may not be safe for embedding in HTML. Or Swift. Or JSON. Or shell commands. Or CSV. And stripping (or outright rejecting) values so that they are safe for embedding in all those contexts (and many others) is too restrictive.

So what should we do? Make sure the data is never in a position to do harm. The best way to achieve this is to avoid interpretation of the data in the first place. Parameterized SQL queries is an excellent example of this; the parameters are never interpreted as SQL, they're simply processed by the database as data.

That same data may be used for other other formats, such as HTML. In that case, the data should be encoded / escaped for that particular language at the moment it's embedded. So, to prevent XSS, data should be HTML-escaped (or javascript or URL escaped) at the time it's being put into the ouput. Not at input time. The same applies to other embedding situations.

So, should we just pass anything we get straight through?

No - there are definitely things you can check about user input, but this is highly context-dependent. Let's call this what it is - validation. Make sure this is done on the server. Some examples:

You should usually verify that any string contains only valid characters for its encoding (e.g., no invalid UTF-8 sequences)
If a field is supposed to be an integer, you can certainly validate this field to ensure it contains an integer (or maybe NULL).
You can often check that a particular value is one of a set of known values (white list validation)
You can require most fields to have a minimum and maximum length.

Why is ensuring valid UTF-8 important? Because invalid UTF-8 sequences are a great way to bypass validation (especially blacklist validation) or spoof visible input as something else. They are quite often interpreted differently by different layers of the technology stack. See Are there any security bugs with UTF-8? for more detail on this kind of attack.

The reason to check for invalid UTF-8, convert single less than signs, and strips octets for security concerns

1 Answers1