Latin1 Connection, instead of UTF8
I've had reports recently from those using my website, that they are unable to create content in Chinese or Arabic. This led me to try creating content on my site using Chinese characters and I have observed that the data being stored for these characters is a question mark ?
.
I realised from other questions and articles that I've read that I'm probably in 'character set hell'.
It seems that I've been connecting to the database using Entity Framework using a Latin1
connection, as this is the default for MySQL, but the columns in my database are UTF8
.
How is my data encoded?
From what I have read of the process, I think that the data in my database is UTF8
data from the browser, encoded as Latin1
during transit to the database which is then encoded again as UTF8
just before it is stored in the database.
I have been using a PHP script to determine whether the data comes out correctly when I set my connection to use UTF8
- and when I select data from my database, it doesn't look like there is any difference.
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8" />
</head>
<body>
<?php
//
// Make the connection to the database
//
$link = mysqli_connect('localhost','root', '', 'mydatabase');
if (!$link) {
die('Could not connect to MySQL: ' . mysql_error());
}
// Set connection character set to UTF8
$link->set_charset('utf8');
echo '<p>Connection OK</p>';
//
// Request the string from the database
//
$result = $link->query("SELECT questiontext FROM question WHERE id = 101");
$row = $result->fetch_assoc();
// Display the data
echo "Result: " . $row['questiontext'] . '<br/>';
mysqli_close($link);
?>
</body>
</html>
- If I request a string of what looks like [A-Z 0-9] characters using a
UTF8
connection to existing data, using this PHP script, it displays on the screen just as it did over theLatin1
connection when using Entity Framework, I can't tell the difference. There doesn't appear to be any issue with the data. - If I request a string of Chinese characters that appear to have been entered into the database as question marks, it displays as question marks when retrieved over a
UTF8
connection.
I expected that when I connected to the database using a UTF8
connection, that the data would be displayed as garbage, as I was previously using a Latin1
connection - but it isn't.
I used Entity Framework to interrogate the MySql variables that were being used for the connection before and after adding CharSet=utf8;
to my connection string. Hopefully can give you an idea about how the connection was being established before and how it is now:
Connection before:
Connection with connection string charset updated:
How can I determine if the data in the database is encoded incorrectly, whether it is Latin1 data encoded as UTF8 so that I can decide if I can just change my connection string to use UTF8 and everything will work OK?
Update
I've been experimenting by switching the connection type between UTF8
and Latin1
and these are my findings...
If I set my connection type to latin1
and output the characters, I end up getting something like this:
Tu es dans une �le d�serte
HEX (bin2hex): 54752065732064616e7320756e6520 ee 6c652064 e9 7365727465203a
If I set my connection to utf8
:
Tu es dans une île déserte
HEX (bin2hex): 54752065732064616e7320756e6520 c3ae 6c652064 c3a9 7365727465203a
(bold and spacing added by Rick James)
When using a UTF8 connection, there aren't any dodgy looking characters at all - only when I set my connection type to latin1
. This leads me to believe that the encoding of my data is OK, presumably it's just straight forward UTF8.
I can only decipher from this, that Entity Framework has been communicating over a UTF8 connection all along, but I don't know how I can confirm that the data is stored correctly.