Proper Charset to work with Vietnamese Characters (that isn't Unicode) in PHP

Question

I've searched around for a while and haven't yet found something that'll work for me. I am using a PHP form to submit data into SAP using the SAP DI API. I need to figure out which character set will actually allow me to store and work with Vietnamese characters.

UTF8 seems to work for a lot of the characters but ô becomes Ã´. More importantly, there are character limits, and UTF-8 breaks character limits. If I have a string of 30 characters it tells the API that it's more than 50. The same is true for storing in MySQL--if there's a varchar character limit, UTF-8 causes the string to go above it.

Unfortunately, when I search, UTF-8 seems to be the only thing people suggest for Vietnamese characters. If I don't encode the characters at all, they get stored as their html character codes. I've also tried ISO-8859-1, converting into UCS-2 or UCS-4... I'm really at a loss. If anyone has experience working with vietnamese characters, your help would be greatly appreciated.

UPDATE

It appears the issue may be with my wampserver on Windows. here's a bit of code that is confusing me:

$str = 'VậTCôNG';
$str1 = utf8_encode($str);
if (mb_detect_encoding($str,"UTF-8",true) == true) {
    print_r('yes');
    if ($str1 == $str) {
        print_r('yes2');
    }
}
echo $str . $str1;

This prints "yes" but not "yes2", and $str.str1 = "VậTCôNGVáºTCÃ´NG" in the browser.

I have my php.ini file with:

default_charset = "utf-8"

and my httpd.conf file with:

AddDefaultCharset UTF-8

and my php file I'm running has:

header("Content-type: text/html; charset=utf-8");

So I'm now wondering: if the original string was utf-8, why wouldn't it equal a utf8 encoding of itself? and why is the utf8 encoding returning wrong characters? Is something wrong in the wampserver configurations?

UTF-8 is the way you want to go in the end, there is no serious alternative. And the UTF-8 character set definitely contains vietnamese characters, the fact that they get "changed" has to be some local issue with your set. However you have to understand how UTF-8 encoding actually works to understand those changes in string length. — arkascha, Feb 21 '17 at 19:36
@arkascha thanks for the response. My problem with UTF-8 is if I have a hard character limit of 50 characters for the SAP DI API and the string is 32 with several Vietnamese characters it'll go over the limit and not enter. This seems like a dealbreaker, even if I do fix the character set problem. — Wan, Feb 21 '17 at 21:19
@arkascha forget this response. I think you're right. I updated my post, do you have any insight on why this is happening? Or what local issue could be occurirng with my set? — Wan, Feb 23 '17 at 19:02
Several issues here: 1. you assume that X===utf8_encode(X), but that certainly is _not_ true. Where did you get the idea that _should be_ true? 2. character encoding detection is a tricky thing. For a given string it might deliver a correct response, it might also fail. That is actually specifically mentioned in the documentation. Simple example: _any_ string can correctly be interpreted as 8bit encoded. Does that mean it is? No! All that is possible is to definitely detect that _some_ strings are certainly _not_ valid utf8 encoded strings. 3. stop using your browser for testing, use CLI. — arkascha, Feb 23 '17 at 19:28
This old post might be of interest for you: http://stackoverflow.com/questions/279170/utf-8-all-the-way-through — arkascha, Feb 23 '17 at 19:29
[`utf8_encode`](http://php.net/manual/en/function.utf8-encode.php) converts from the ISO-8859-1 so explain why you use it. See also [`html_entity_decode`](http://php.net/manual/en/function.html-entity-decode.php) — Deadooshka, Feb 23 '17 at 19:31
@arkascha thanks for the help, I'm still getting the hang of this. So, I think my string is already in UTF-8 but I don't know how to be sure. When I pass it to the DI API I get " VáºTCÃ´NG" or something like that, which is the same as what happens when I run utf8_encode on the string. I'm now testing on the command line and getting Vß║¡TC├┤NG when it's running. Should the command prompt be able to read these characters or is this indicative of the problem? — Wan, Feb 23 '17 at 20:46
Actually MS-Windows is well known to have issues with unicode, sorry. Only one of the areas where it cannot hide its closed and limited origin. You can test your original string yourself by using a `hexeditor`. That is (in my eyes) the only tool that really allows to peek into what a file _really_ contains. You can spot the multi byte sequences in there if you understood how UTF encoding actually works. All APIs, all browsers only add layers between you and the data, layers trying to be smart and "fixing" things which often actually fails... — arkascha, Feb 23 '17 at 21:02
`mb_detect_encoding($str, 'UTF-8', true)` returns the string `'UTF-8'`. Vietnamese locale is `windows-1258` / `CP1258`. — Deadooshka, Feb 24 '17 at 15:49

score 0 · Answer 1 · edited May 23 '17 at 11:46

Ã´ is the "Mojibake" for ô. That is, you do have UTF-8, but something in the code mangled it.

See Trouble with utf8 characters; what I see is not what I stored and search for Mojibake. It says to check these:

The bytes to be stored need to be UTF-8-encoded. Fix this.
The connection when INSERTing and SELECTing text needs to specify utf8 or utf8mb4. Fix this.
The column needs to be declared CHARACTER SET utf8 (or utf8mb4). Fix this.
HTML should start with <meta charset=UTF-8>.

It is possible to recover the data in the database, but it depends on details not yet provided.

http://mysql.rjweb.org/doc.php/charcoll#fixes_for_various_cases

Each Vietnamese character take 2-3 bytes for encoding in UTF-8. It is unclear whether the "hard 50" is really a character limit or a byte limit.

If you happen to have Mojibake's sibling "double encoding", then a Vietnamese character will take 4-6 bytes and feel like 2-3 characters. See "Test the data" in the first link.

An example of how to 'undo' Mobibake in MySQL: CONVERT(BINARY(CONVERT('VáºTCÃ´NG' USING latin1)) USING utf8mb4) --> 'VậTCôNG'

"Double encoding" is sort of like Mojibake twice. That is one side treats it as latin1, the other as UTF-8, but twice.

VậTCôNG, as UTF-8, is hex 56e1baad5443c3b44e47. If that hex is treated as character set cp850 or keybcs2, the string is Vß║¡TC├┤NG.

Hi @Rick James, I updated my post to convey my current case. Is mojibake the same as double encoding? Unfortunately I'm not using data in a database (just feeding it a string directly in PHP for now) so I'm not sure how to test. If the SAP DI API is turning my characters into mojibake does that mean it's doing encoding itself? It seems as though the import of characters by the API is having the same effect as running utf8_encode on them i.e. both return VáºTCÃ´NG. do you have any idea on what to do in this situation? — Wan, Feb 23 '17 at 19:00
Double encoding is kinda Mojibake twice. I added to my answer. Sorry, I don't know how to deal with it purely in PHP. It took me a long time to figure out that and 4 other error cases in MySQL. — Rick James, Feb 23 '17 at 22:09

score -2 · Answer 2 · answered Feb 21 '17 at 19:52

-2

Change it to VISCII.

Input: ô 
Output: ô

You can test it at Charset converter.

answered Feb 21 '17 at 19:52

r0xette

898
3
11
24

Proper Charset to work with Vietnamese Characters (that isn't Unicode) in PHP

2 Answers2