4

I'm trying to change my encoding to utf-8, below is what I have so far.

Table Charset

my table

mbstring installed

extension=php_mbstring.dll

mbstring configured in php.ini

mbstring.language = Neutral
mbstring.internal_encoding = UTF-8
mbstring.encoding_translation = On /*updated it to mbstring.encoding_translation = 0*/
mbstring.http_input = auto         /*updated it to mbstring.http_input = pass*/
mbstring.http_output = UTF-8       /*updated it to mbstring.http_output = pass*/
mbstring.detect_order = auto  
mbstring.substitute_character
default_charset = UTF-8
mbstring.func_overload = 7

Header

header('Content-type: text/html; charset=UTF-8');

HTML meta tag

<meta http-equiv="Content-type" value="text/html; charset=UTF-8" />

HTML CODE

<label for="article_body_pun">Article (Foreign): </label>
<textarea cols="100" rows="10" name="article_body_pun"></textarea><br />

PHP

$article_body_pun   = $_REQUEST['article_body_pun'];

SQL

$insert_article = "INSERT INTO articles(article_body_pun) 
                      VALUES ('{$article_body_pun}'')";

PHP to insert

$article_query = mysqli_query($connectDB, $insert_article);

Data that should be stored -> 汉语

Original Data stored

汉è¯Â

Upon adding mysqli_set_charset($connectDB, "utf8"); as suggested by @Pekka 웃, output became (commented below as well)

æ±è¯

after some troubleshooting, data partially stored correctly.

�?语

tried checking the charset by mb_detect_encoding, and getting UTF-8 on the results pulled.

and upon checking the charset in firefox.

enter image description here

That seems to be correct, but still getting question marks on some characters. Any further suggestions to make this work?

vephelp
  • 552
  • 2
  • 10
  • 24
  • +1 for a complete, well-structured question! You are missing some things though, namely setting the connection encoding to UTF-8. Check out [UTF-8 all the way through](http://stackoverflow.com/q/279170) – Pekka Dec 05 '13 at 01:43
  • @Pekka웃, any suggestion with the changes above? – vephelp Dec 06 '13 at 00:44

2 Answers2

1

You're nearly there: make sure the mySQL connection is also encoded UTF-8.

Check out UTF-8 all the way through for details.

Community
  • 1
  • 1
Pekka
  • 442,112
  • 142
  • 972
  • 1,088
  • before the text stored was like `汉语`. After inserting the following code `mysqli_set_charset($connectDB, "utf8");` I'm getting this `æ±è¯`­, instead it should look like this `汉语`. – vephelp Dec 05 '13 at 01:52
  • Hmmm, that's strange. The page that contains the form is UTF-8 encoded and it shows so in the browser, correct? – Pekka Dec 05 '13 at 01:56
  • in my database actually, using phpmyadmin. – vephelp Dec 05 '13 at 01:57
  • 1
    @vephelp Is it possible that your phpmyadmin webpage simply doesn't render the database content correctly, can you try checking the database content outside of the phpmyadmin environment? – Willem Dec 06 '13 at 00:36
  • @Willem yes, currently I'm retrieving the data into the browser itself, and getting the output included in my recent edit above. – vephelp Dec 06 '13 at 00:38
  • Have you read: http://stackoverflow.com/questions/279170/utf-8-all-the-way-through#comment-22638702 ? It may account for your, as you stated, partially correct results – Willem Dec 06 '13 at 00:50
  • @Pekka웃, have added `` in header of the output page. So I guess that is UTF-8 now. Not sure though. – vephelp Dec 06 '13 at 00:56
  • @vephelp Do you have a live link? – Pekka Dec 06 '13 at 00:57
  • 1
    汉 in unicode is 6C49, 4 bytes, according to the comment located in my earlier link, the 'normal' utf-8 encoding used in MySQL stores unicode characters up to 3 bytes, consider trying the https://dev.mysql.com/doc/refman/5.5/en/charset-unicode-utf8mb4.html mysql character encoding. – Willem Dec 06 '13 at 01:41
  • @Willem Just to confirm, isn't 语 4 bytes as well? unicode value is 8BED. If yes, then why is that getting stored properly? – vephelp Dec 06 '13 at 15:33
  • @Willem, I'm quiet new at this but just read that utf8 is acutally 8 bytes. So I guess that shouldn't be an issue. I guess, haha! – vephelp Dec 07 '13 at 15:46
  • and that I guess is for the PHP part, but mysql is 3bytes haha! nice info! – vephelp Dec 07 '13 at 16:00
  • @vephelp nope, UTF-8 is 1 to 4 bytes per character. MySQL draws the line at 3 bytes though – Pekka Dec 07 '13 at 16:33
  • ow ok, @Pekka웃 so I will have to use utf8mb4 charset for my database? But what I've read around is all about utf8, they say it works fine. – vephelp Dec 07 '13 at 17:06
  • @vephelp yeah, in theory mySQL's UTF-8 supports all characters in the so-called "basic multilingual plane" which `汉` should be in. Not sure what is going on here – Pekka Dec 07 '13 at 17:20
  • @vephelp your character is 3 bytes and should be ok: `E6 B1 89` http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=%E6%B1%89 – Pekka Dec 07 '13 at 18:02
  • mhhm. I'm sort of unsure now on what else to do. – vephelp Dec 07 '13 at 19:07
0

I was able to fix the problem with help of a friend, the data was not inserted correctly from my HTML form to Database. Seems like my mbstring configurations were causing the problem, had to update the following:

mbstring.http_input = pass
mbstring.http_output = pass
mbstring.encoding_translation = 0

so the values are just reverted back to default and it worked perfectly.

Thanks to @Pekka 웃 and @Willem for their help.

vephelp
  • 552
  • 2
  • 10
  • 24