0

I've failed to find the definitely best practices when it comes to handling incoming data. Some other threads had useful information but still I have a lot of unanswered question. All I know for sure is UTF-8 is the one and only modern standard. My question involves the use of php, but maybe there is some general uses that can apply to other languages. I'm willing to respect the accepted standards, assuming the performance costs are negligible enough. Feel free to point toward benchmarks to justify some particular choices.

1) Should you really check every incoming data (apis, get, post, ...), subject to manipulation or storage? In the particular case of websocket and Rest API, I can't see that as sane performance wise...constant encoding string checks on every incoming data and variable, is that really what should be done for good practices? If yes, any method that is not too costly on server ressources? I've seen this being used to determine if a variable is UTF-8 :

if(preg_match('!!u', $data))
{
   echo 'this is utf-8'; //use the var
}
else 
{
   echo 'definitely not utf-8'; //do something else
}

Doing this all the time feels like overkill. And shouldn't that function be mb_ereg_match?

2) Assuming the you should always check incoming data, what is a viable function to use in order to convert the data into UTF-8?

3) How about dates, int, decimals taken from a database, or from get/post...do they have anything to do with UTF-8, do you have to encode them into UTF-8 before sending them to mysql? As for line breaks, do they "appear" in utf-8 as visible line break, or do they always show as \r\n in a utf-8 text? Is there a reason why phpMyAdmin replace \r\n by visible line breaks in the interface, in that case?

4) Same question for arrays (especially those to be encoded into json):

  • should the array key be encoded to utf-8?
  • should the data inside the keys be encoded to utf-8?
  • should all the variable array itself be encoded to utf-8?

5) Should we learn to use multibyte versions of strings functions instead of the usual non multibyte string functions, as shown in http://php.net/manual/en/ref.mbstring.php ? that means taking all the typed code, and replace the function for the sake of easy reusability...

6) When using utf8mb4_unicode (or a variation of this) on mysql columns, what is the maximal VARCHAR() size possible? Apparently 255 is not an option. I'm also wary about performances when the field is part of an index.

7) Always regarding good enough performance in order to apply the best practice, can you please confirm (or correct) that the following is a proper way to handle encoding in a php/mysql environment, or if an element is missing; always being up to date with the software is not listed, as it is common sense.

  • Mysql: use of utf8mb4_unicode_520_ci as collation by default, and on every column that can contain anything other than numbers, dates or times.
  • Web Page: use of <meta charset="UTF-8"> by default.
  • PHP Server: use of the extension mbstring and its Multibyte Support parameter enabled. default_charset=UTF-8 in php.ini.
  • PHP Script: use of mb_internal_encoding('UTF-8'); followed by mb_http_output('UTF-8'); on every .php pages, at the very beginning after the php tag <?php. (Can't this be setup as default in php?)
  • PDO: use of the parameter charset=utf8mb4 when creating a new PDO object.
  • Text Editor: If using Notepad++, using "Encode in UTF-8" parameter from the very beginning, for every pages regardless of the extension.

Hoping for this thread to be the last and most comprehensive place to learn about the best encoding practices, with acceptable performance, in a php/sql environment.

user9203881
  • 5
  • 1
  • 4

1 Answers1

0

Everything I'm about to say is secondary to: UTF-8 all the way through

  1. You should always know the encoding of your input beforehand, either from following the above, or because you've either provided standards to, or been provided standards from, external data providers. Guessing at encodings is a bad idea, and so is attempting to detect the encoding. This includes using a function like mb_detect_encoding() because there's no good way to actually detect an encoding and at the end of the day it's an educated guess at best.

  2. mb_convert_encoding() with both the input and output encodings specified because #1.

  3. If your input is a string you should handle it as such according to the above. If it's a number it's more or less universal. There are edge cases to this, but it's unlikely that anyone will encounter them without being in deeper trouble.

  4. Arrays are a complex type and cannot be transmitted between systems without some form of intermediate encoding, and the rules of that encoding will define how to handle string data and the string encoding of the data itself. Eg: Read the JSON spec.

  5. Yes. If you're using a multibyte encoding you should be using the multibyte functions where applicable.

  6. IIRC this depends on the page size and the overall size of the data in your column as it all needs to fit inside a single page. You can fudge this with the TEXT types because they're technically stored off-page, but they have their own tradeoffs. This is a whole question unto itself that's probably answered elsewhere.

  7. UTF-8 all the way through

Sammitch
  • 30,782
  • 7
  • 50
  • 77