0

I noticed that when doing database queries in PHP (e.g. Zend_db, mysqli...), you can set the character set. For example: mysqli_set_charset($con,"utf8"); I'm a little foggy as to what this actually does behind the scenes.

If I use php to do a database SELECT query, and I indicate a character set, what happens if it's not the same character set that the column was defined as in the database?

I mean, the database returns a binary sequence, but what is actually returned if the character is not encoded the same in the two character sets? Will mySQL take the internal binary data and return it "As-is"?

Or will MySQL try to convert it to the binary sequence that's the equivalent in the character set you indicated?

I guess the gist of my question is that I would like to know how the data is encoded when PHP is sending in the query, how it's transmitted back from MySQL, and whether there's another step of translation after PHP gets it back and stores it into a string in PHP internal memory.

Similarly, if you're doing an INSERT or update, how does it get sent from PHP to MySQL? Does PHP convert it to the correct binary encoding THEN send it into MySQL?

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx Update:

Thanks to Raymond Nijland. I was able to fix my bug. But I did notice that for nonstandard characters, the charset does seem to matter.

I did a select statement using $db = new \PDO("mysql:host=$host;dbname=$database;charset=latin1", $dbuser, $dbpassword);. First, I tried latin1, then I tried utf8.

My problem was that I had a column with encrypted data, which I guess had some wierd characters in it. if I did an md5 on that field directly in the database query, it gave me an encoding that began with 889... BUT, I tried to pulled it into PHP with a SELECT statement. If I used PDO with a charset of latin1, then did an MD5() inside of PHP, it gives me the same hash (889...). Which implies that PHP has an exact copy of the binary that's in the database. BUT if I did used PDO with charset 'UTF-8', then did an MD5() in PHP, it gave me a hash beginning with 087... So somewhere, a conversion must be taking place.

At this point, my orignal bug is fixed, but I'm still curious as to what's happening. Is MYSQL doing the conversion before returning it to PHP, or does PDO do some sort of conversion on the PHP side?

NL3294
  • 984
  • 1
  • 10
  • 27
  • 2
    Short answer: `mysqli_set_charset($con,"utf8");` needs always to be the same as the MySQL server charset connection setting.. Otherwise charset encoding injections can happen which might cause SQL injections with or without prepared statements.. – Raymond Nijland Sep 23 '19 at 16:58
  • 1
    *"I mean, the database returns a binary sequence, but what is actually returned if the character is not encoded the same in the two character sets? Will mySQL take the internal binary data and return it "As-is"? "* in general you should be fine when you use utf8 with other standardized charsets like latin (ISO 8859-1, ISO 8859-15) or ascii (US-ASCII/ANSI_X3.4-1968), don't mix it up with a "exotic" non standardized charset like Windows-1252 for example – Raymond Nijland Sep 23 '19 at 17:02
  • So if you use the wrong charset in mysqli_set_charset, it's supposed to stop you by throwing an error? I'm still unclear. Are you stating that utf-8 perflect encompasses latin1 exactly, and therefore, I don't need to worry about it? I always thought there were small differences, but I can't find any examples at the moment. – NL3294 Sep 23 '19 at 17:21
  • *"So if you use the wrong charset in mysqli_set_charset, it's supposed to stop you by throwing an error?"* No it wont throw a error i wish it did, it would made PHP a bit more safer *" I don't need to worry about it? I always thought there were small differences,"* Indeed you don't have to worry ISO 8859-1 (latin1) fits into utf8 as ascii fits into ISO 8859-1 (latin1) ... – Raymond Nijland Sep 23 '19 at 17:24
  • Thanks. I was able to fix my bug. But I'm still trying to get a deeper understanding of what's going on behind the scenes. I'm going to update my question. – NL3294 Sep 23 '19 at 18:06

1 Answers1

1

mysqli_set_charset($con,"utf8"); (or other code for other client libraries) declares to MySQL that the encoding in the client will be MySQL's CHARACTER SET utf8. If bytes with a different encoding are sent to (think INSERT) mysql, garbage or errors will occur.

That setting also declares that the client desires that encoding from SELECTs.

The CHARACTER SET on each column in each table may be something else (eg, "latin1"). If so, MySQL will attempt to convert the encoding during the transmission.

Caution: MySQL's CHARACTER SET utf8 is a subset of the well-known UTF-8. To get the latter, use CHARACTER SET utf8mb4 in tables and mysqli_set_charset($con,"utf8mb4"); when connecting.

Going forward, utf8mb4 is preferred in most situations.

Non-text stuff ("as-is") should be put into BLOB or VARBINARY columns -- this bypasses any checking of the encoding. (Think a .jpg or AES_ENCRYPT.)

MySQL's MD5() function returns a hex string. UNHEX(MD5('...')) return binary stuff and must be store in, say, a BINARY(16) column.

Many forms of garbled text are discussed in Trouble with UTF-8 characters; what I see is not what I stored .

Rick James
  • 135,179
  • 13
  • 127
  • 222