1

Latin1 Connection, instead of UTF8

I've had reports recently from those using my website, that they are unable to create content in Chinese or Arabic. This led me to try creating content on my site using Chinese characters and I have observed that the data being stored for these characters is a question mark ?.

I realised from other questions and articles that I've read that I'm probably in 'character set hell'.

It seems that I've been connecting to the database using Entity Framework using a Latin1 connection, as this is the default for MySQL, but the columns in my database are UTF8.

How is my data encoded?

From what I have read of the process, I think that the data in my database is UTF8 data from the browser, encoded as Latin1 during transit to the database which is then encoded again as UTF8 just before it is stored in the database.

I have been using a PHP script to determine whether the data comes out correctly when I set my connection to use UTF8 - and when I select data from my database, it doesn't look like there is any difference.

  <!DOCTYPE html>
  <html>
  <head>
   <meta charset="utf-8" />
  </head>
  <body>
    <?php 
    
    //
    // Make the connection to the database
    //
    $link = mysqli_connect('localhost','root', '', 'mydatabase');

    if (!$link) { 
        die('Could not connect to MySQL: ' . mysql_error()); 
    }

    // Set connection character set to UTF8
    $link->set_charset('utf8');
    
    echo '<p>Connection OK</p>';
    
    //
    // Request the string from the database
    //
    $result = $link->query("SELECT questiontext FROM question WHERE id = 101");
    
    $row = $result->fetch_assoc();
    
    // Display the data
    echo "Result: " . $row['questiontext'] . '<br/>';
    
    mysqli_close($link);
    
    ?>
  </body>
  </html>
  • If I request a string of what looks like [A-Z 0-9] characters using a UTF8 connection to existing data, using this PHP script, it displays on the screen just as it did over the Latin1 connection when using Entity Framework, I can't tell the difference. There doesn't appear to be any issue with the data.
  • If I request a string of Chinese characters that appear to have been entered into the database as question marks, it displays as question marks when retrieved over a UTF8 connection.

I expected that when I connected to the database using a UTF8 connection, that the data would be displayed as garbage, as I was previously using a Latin1 connection - but it isn't.

I used Entity Framework to interrogate the MySql variables that were being used for the connection before and after adding CharSet=utf8; to my connection string. Hopefully can give you an idea about how the connection was being established before and how it is now:

Connection before:

Connection before

Connection with connection string charset updated:

Connection after connection string update

How can I determine if the data in the database is encoded incorrectly, whether it is Latin1 data encoded as UTF8 so that I can decide if I can just change my connection string to use UTF8 and everything will work OK?

Update

I've been experimenting by switching the connection type between UTF8 and Latin1 and these are my findings...

If I set my connection type to latin1 and output the characters, I end up getting something like this:

Tu es dans une �le d�serte

HEX (bin2hex): 54752065732064616e7320756e6520 ee 6c652064 e9 7365727465203a

If I set my connection to utf8:

Tu es dans une île déserte

HEX (bin2hex): 54752065732064616e7320756e6520 c3ae 6c652064 c3a9 7365727465203a

(bold and spacing added by Rick James)

When using a UTF8 connection, there aren't any dodgy looking characters at all - only when I set my connection type to latin1. This leads me to believe that the encoding of my data is OK, presumably it's just straight forward UTF8.

I can only decipher from this, that Entity Framework has been communicating over a UTF8 connection all along, but I don't know how I can confirm that the data is stored correctly.

Community
  • 1
  • 1
Luke
  • 22,826
  • 31
  • 110
  • 193
  • 1
    Reference: [How to support UTF-8 completely in a web application](http://stackoverflow.com/questions/279170/how-to-support-utf-8-completely-in-a-web-application) – FirstOne Jul 13 '16 at 13:35
  • 1
    Thanks for linking that. I think I know what the solution is going forward, as I have changed my connection to use UTF8, my web pages are set to output in UTF8 and the tables I am storing to are UTF8 - but the thing I'm trying to get to the bottom of is 'Is my existing stored data incorrectly encoded?' and 'Does the existing data require dumping out and reloading?' – Luke Jul 13 '16 at 13:37
  • If you're really storing "???" characters in your database, then your data is *gone*. If it's something even slightly different, provide a sample. – deceze Jul 13 '16 at 13:39
  • 1
    If you read the link FirstOne provided, you will note that you do *not* want to be storing anything in MySQL in `utf8_` but instead be using `utf8mb4_` . MySQL UTF-8 is actually more like UTF3 – Martin Jul 13 '16 at 13:43
  • Thanks @Martin I will look into that now. – Luke Jul 13 '16 at 13:47
  • @deceze Yes, it seems that if I enter Chinese characters and submit them to be stored, they store as question marks. If I select `HEX(questiontext)` on this data, it comes back as `F3F3F3F3` which I'm sure I read somewhere that this data is now gone. What about other data, just 'normal' characters? Is there a way to check that the readable data is encoded correctly? Because the other data looks fine when I connect as UTF8 and output it onto a web page. – Luke Jul 13 '16 at 13:51
  • I just attempted to store the data '新浪新闻' (Random chinese characters that I took from the Internet) and stored it using my PHP script with the encoding set to `latin1` in an attempt to reproduce what happens in Entity Framework, but it stored in the database as `新浪新闻` which is obviously incorrect. But this isn't what happens with my site when using Entity Framework, it just stores the data as `???????`. – Luke Jul 13 '16 at 14:05
  • Should I assume that if the data looks right when output onto the page when using a UTF8 connection that the existing data is encoded correctly? – Luke Jul 13 '16 at 14:11
  • There may be a setting within Entity for how to handle multibyte characters, within the framework. I don't know. You would also probably need to use PHP `mb_string` functions if you're doing any processing work on any string *contents*. While UTF-8 is a pretty good universal character set I have read some articles which state there are *some* Chinese and Japanse characters that do not have a UTF-8 translation, but I think these are edge cases. – Martin Jul 13 '16 at 14:14
  • Yeah, I have a feeling that Entity Framework is doing something special with the data, to be storing is at question marks. Is there a way to test if data is Latin1 stored as UTF8, or is the data correct as UTF8 if it outputs normally? – Luke Jul 13 '16 at 14:32
  • 1
    `ee` and `e9` are latin1 encodings. `Cxxx` are utf8 encodings. – Rick James Jul 13 '16 at 21:41
  • 1
    @Martin -- _Outside_ MySQL, "UTF-8" includes _all_ characters. _Inside_ MySQL, "utf8" refers to the 3-byte subset that includes most of Chinese (and most of the rest of the world); "utf8mb4" includes all encodings through 4 bytes, which picks up the rest of Chinese. (There are no 5-byte encodings defined yet.) – Rick James Jul 13 '16 at 21:50
  • Thank you guys, I really am learning a lot here from you all. – Luke Jul 13 '16 at 22:21

3 Answers3

2

For Chinese, you need to tell MySQL to use utf8mb4, not just utf8.

When trying to use utf8/utf8mb4, if you see Question Marks (regular ones, not black diamonds) (? is hex 3F),

  • The bytes to be stored are not encoded as utf8. Fix this.
  • The column in the database is CHARACTER SET utf8mb4. Fix this.
  • Also, check that the connection during reading is utf8mb4.

新浪新闻 is Mojibake for 新浪新闻

When trying to use utf8/utf8mb4, if you see Mojibake, check the following. This discussion also applies to Double Encoding, which is not necessarily visible.

  • The bytes to be stored need to be utf8-encoded.
  • The connection when INSERTing and SELECTing text needs to specify utf8mb4. (set_charset)
  • The column needs to be declared CHARACTER SET utf8mb4. (Check with SHOW CREATE TABLE.)
  • HTML should start with <meta charset=UTF-8>. (You did this.)

To verify, do SELECT col, HEX(col) FROM .... If the hex output for is E696B0, then it is correctly encoded with utf8/utf8mb4. If you get C3A6E28093C2B0, it is "double-encoded". Usually if the hex begins with E or F, it is probably correctly encoded. Also, the hex for a single Chinese character will be 6 or 8 long in all cases. Reference.

Rick James
  • 135,179
  • 13
  • 127
  • 222
  • Thank you for this information. I have a question for you... If I connect to my database using a UTF8 connection and all of the characters output correctly, can I assume that the existing data in my database is encoded correctly as UTF8 and not latin1 encoded as UTF8 as I suspected? – Luke Jul 13 '16 at 21:22
  • `SELECTing` can deceive. I added a technique using HEX to my answer. – Rick James Jul 13 '16 at 21:36
  • Thank you so much. I did a more simple test and submitted the letter `é` then I ran HEX() on it and it came out as `C3A9`. Am i correct in saying that if it stored it like that, then it looks like I'm connecting to my database over UTF8 anyway? – Luke Jul 13 '16 at 22:15
  • The reason that I can't try with the Chinese characters is that they store as `3F` at the moment. – Luke Jul 13 '16 at 22:21
  • Yes and no. `C3A9` confirms that `é` was correctly stored as utf8/utf8mb4. However, if you _connect_ to the database incorrectly and/or _display_ it incorrectly, you might _see_ `?` or `�` or `é`. The `HEX` test confirms that the `INSERT` worked correctly. After that, one can focus on getting the `SELECT` (etc) working correctly. That Reference link is in the middle of a huge, rambling, attempt to make sense of character set (and collation) problems in mysql. – Rick James Jul 13 '16 at 22:25
  • And most of my answer is about half of the much-condensed document I am attempt to solidify. In addition to Question marks and Mojibake, there are Black Diamonds and Truncated. These _seem_ to be the 4 things that one _sees_. From there I deduce what the user did wrong to get that output. – Rick James Jul 13 '16 at 22:31
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/117263/discussion-between-luke-and-rick-james). – Luke Jul 13 '16 at 23:28
0

try to use this :

// Display the data echo "Result: " . utf8_encode($row['questiontext']) . '
';

;)

AndreLuisRomero
  • 67
  • 2
  • 10
0

application->mysql: select HEX('中国')
mysql->application: select UNHEX('E4B8ADE59BBD') you can use the MySQL fuction

Zed
  • 23
  • 1
  • 7