I've failed to find the definitely best practices when it comes to handling incoming data. Some other threads had useful information but still I have a lot of unanswered question. All I know for sure is UTF-8 is the one and only modern standard. My question involves the use of php, but maybe there is some general uses that can apply to other languages. I'm willing to respect the accepted standards, assuming the performance costs are negligible enough. Feel free to point toward benchmarks to justify some particular choices.
1) Should you really check every incoming data (apis, get, post, ...), subject to manipulation or storage? In the particular case of websocket and Rest API, I can't see that as sane performance wise...constant encoding string checks on every incoming data and variable, is that really what should be done for good practices? If yes, any method that is not too costly on server ressources? I've seen this being used to determine if a variable is UTF-8 :
if(preg_match('!!u', $data))
{
echo 'this is utf-8'; //use the var
}
else
{
echo 'definitely not utf-8'; //do something else
}
Doing this all the time feels like overkill. And shouldn't that function be mb_ereg_match
?
2) Assuming the you should always check incoming data, what is a viable function to use in order to convert the data into UTF-8?
3) How about dates, int, decimals taken from a database, or from get/post...do they have anything to do with UTF-8, do you have to encode them into UTF-8 before sending them to mysql?
As for line breaks, do they "appear" in utf-8 as visible line break, or do they always show as \r\n
in a utf-8 text? Is there a reason why phpMyAdmin replace \r\n
by visible line breaks in the interface, in that case?
4) Same question for arrays (especially those to be encoded into json):
- should the array key be encoded to utf-8?
- should the data inside the keys be encoded to utf-8?
- should all the variable array itself be encoded to utf-8?
5) Should we learn to use multibyte versions of strings functions instead of the usual non multibyte string functions, as shown in http://php.net/manual/en/ref.mbstring.php ? that means taking all the typed code, and replace the function for the sake of easy reusability...
6) When using utf8mb4_unicode
(or a variation of this) on mysql columns, what is the maximal VARCHAR()
size possible? Apparently 255 is not an option. I'm also wary about performances when the field is part of an index.
7) Always regarding good enough performance in order to apply the best practice, can you please confirm (or correct) that the following is a proper way to handle encoding in a php/mysql environment, or if an element is missing; always being up to date with the software is not listed, as it is common sense.
- Mysql: use of
utf8mb4_unicode_520_ci
as collation by default, and on every column that can contain anything other than numbers, dates or times. - Web Page: use of
<meta charset="UTF-8">
by default. - PHP Server: use of the extension
mbstring
and its Multibyte Support parameter enabled.default_charset=UTF-8
in php.ini. - PHP Script: use of
mb_internal_encoding('UTF-8');
followed bymb_http_output('UTF-8');
on every .php pages, at the very beginning after the php tag<?php
. (Can't this be setup as default in php?) - PDO: use of the parameter
charset=utf8mb4
when creating a new PDO object. - Text Editor: If using Notepad++, using "Encode in UTF-8" parameter from the very beginning, for every pages regardless of the extension.
Hoping for this thread to be the last and most comprehensive place to learn about the best encoding practices, with acceptable performance, in a php/sql environment.