Managing Korean multi-byte string with mb_substr() produces gibberish

Question

I have a string in Korean(multi byte string), with UTF-8 encoding, when using mb_substr() it fails to detect it as multi byte and hence mb_substr() works like substr() and I end up getting gibberish characters like "�" at the end of the string.

星期三大象键盘开裂青蛙混杂纪念碑问题面包车斑马线 수요일 코끼리 키보드 개구리 뒤범벅 비석 이 질문에 반 얼룩말을 크래킹

Also using mb_detect_encoding() I get UTF-8, any ideas where am I going wrong?

The current function that I am using is :

function cleanseData($data, $mode = false, $limit = 0) {
    if ($mode) {
        $data = (mb_strlen ( $data ) > ($limit + 3)) ? mb_substr ( $data, 0, $limit, mb_detect_encoding($data) ) . '...' : $data;
    }
    $data = utf8tohtml ( $data, true );
    return $data;
}

Could You please, show some code? – Kamiccolo Oct 07 '15 at 14:12 — Kamiccolo, Oct 07 '15 at 14:12

score 0 · Answer 1 · edited May 23 '17 at 12:02

Don't use any of the mb or utf8tohtml functions. State the everything at every stage is utf8. See UTF-8 all the way through

� probably comes from not having utf8 characters in the first place, and using the default SET NAMES latin1 instead of SET NAMES utf8.

Could it be that your text is EUCKR? Please provide the hex for some character; I may be able to dig further.

Also please do this to see what is in the table:

SELECT col, HEX(col) FROM tbl WHERE ...

That will give a clue as to whether the data was mangled going into the table, or mangled coming out.

Correctly encoded in utf8 (or utf8mb4), 星期三 is hex E6989F E69C9F E4B889, and 보드 개 is hex EBB3B4 EB939C 20 EAB09C (I added spaces for clarity.)

What you have is a combination of Chinese and Korean, correct? I strongly recommend utf8mb4 throughout.

Managing Korean multi-byte string with mb_substr() produces gibberish

1 Answers1