0

I have some problem with displaying Russian data from my MySQL database. Background: Some other programmer created the database and I am responsible for it now.

My problem: On a HTML page a list of statements from my database is outputted. The statements are all Russians. Out of all 80 statements, there are 6 statements which at their ends are outputted with a black icon with a white question mark in it (�).

The common thing (!!!) is that the question mark always appears at the end of the statement and the statements are the longest with around 80 chars.

Here comes a list of some statements on my HTML page:

  • Я действую скорее активно, агрессивно и, думая о своей выгоде
  • Как интровертный своенравный человек, который демонстрирует мало собственной ин�
  • Являешься убедительным, располагающим и целеустремленным
  • Я демонстрирую разум и ясность, объективность и логику
  • Кого-то, кто самостоятельно мыслит и знает, как себе помочь
  • То мне следовало бы сдержанно реагировать и найти другой путь, чтобы получить бол�

I had a look into the database into my table "statements" and found that the statements are saved as inidentifable chars. For example, the saved entry for the second russian statement above (which contains the question mark symbol at the end) is the following:

Я дейÑтвую Ñкорее доверительно и оказываю Ñвое доверие также другим, которые к неÐ

About the table: Its collation is utf8_unicode_ci with the fields having VARCHAR(150) as data type.

My questions:

  1. Why are the statements saved with these unidentifable letters in my table? How can I convert them to Russian?
  2. What is the reason for the question mark symbol in the html output? How can I get rid of it?

EDIT:

In the same project I found a further problem which I describe in a separate post. Maybe this problem has to do with it?

PHP: Special char becomes question mark icon in substring

Community
  • 1
  • 1
Max
  • 832
  • 1
  • 14
  • 33
  • "statements are saved as inidentifable chars." - are you sure it's not a display problem? The data is stored in UTF8, but it seems that it's displayed to you in CP1251 by whatever tool you are using.. – Timekiller Jan 23 '16 at 08:38
  • Yeah, could be a display problem. How can I get rid of the problem? – Max Jan 23 '16 at 08:39
  • and seems you need to utilize utf8 collation for your table as well.. – ameenulla0007 Jan 23 '16 at 08:41
  • @ameenulla0007 The table already has utf8_unicode_ci collation, as stated in my post. What do you mean? – Max Jan 23 '16 at 08:47
  • What version of MySQL? Please provide `SHOW CREATE TABLE`. – Rick James Nov 12 '18 at 17:18
  • The black diamond usually comes from the browser. Please provide `SELECT HEX(col), col ...` for one of the bad rows. – Rick James Nov 12 '18 at 17:20

4 Answers4

1

utilize 'SET names=utf8' query while insertion, the problem is, while inputting (insertion), data is converted into some special characters.

so if you use this, the data what you input in russian language, the same data will be added into database too. and while retrieval you will get the result same as you inputted.

ameenulla0007
  • 2,663
  • 1
  • 12
  • 15
0

use mbstring can change encoding http://php.net/manual/en/book.mbstring.php

0

It sounds like a simple truncation problem. Varchar(150) says field is maximum 150 bytes. UTF8 can use more than one byte per symbol - for example, each letter in cyrillic will use 2 bytes, while space or comma symbols will use 1 byte. So if the string is more than 150 bytes long, it's possible that the cyrillic letter is truncated in the middle. For example, in your second sentence, small и has utf8 code d0 b8, but was truncated to d0 which is non-printable symbol, resulting in ?s you see. Nothing you can do about it, the data is already lost. You can only prettify the display by removing standalone byte in the range of C2..DF from the end of the string.

As for unidentified letters you see, there's a lot of factors. Database encoding, connection encoding, table collation and display encoding if you're using a web interface all contribute to the mess, there might be multiple re-encodings to the string before you see it. It's also possible that the data is recoded in some way on insert and decoded before display - not the worst thing I saw in legacy code, really. You'll have to experiment and find the proper combination yourself.

Timekiller
  • 2,946
  • 2
  • 16
  • 16
  • Hello Timekiller, thank you very much for your answer. What would be quick measures to fix things a little bit? Increasing varchar for future inserts? How can I remove standalone byte from the end of the string in my case? – Max Jan 23 '16 at 09:24
  • Yes, increase field size, then take a look at `mb_strcut()` function - according to the manual it allows to cut the string to N bytes without breaking multibyte character sequence, that's just what you need. For replacing invalid utf8 chars, take a look at http://stackoverflow.com/a/13695364/1375470 - though in your particular case replacing regexp `[\xC2-\xDF]$` with empty string might suffice. – Timekiller Jan 23 '16 at 09:49
  • I just found out that varchar(150) is not enough. I will increase it to 300. Might there be negative implications when increasing it for every field? – Max Jan 23 '16 at 10:19
  • `VARCHAR(150)`, in MySQL, has a limit of 150 _characters_. – Rick James Nov 12 '18 at 17:03
  • Two 'wrongs' won't make a 'right'. Did you find that 300 did not make these strings any longer? – Rick James Nov 12 '18 at 17:43
0

No. I think something else is truncating to 150 bytes. Then what comes to MySQL is an invalid utf8 string, ending in half of a Cyrillic character (Ð, hex D0).

  1. This previous process grumbled about the broken utf8 character by turning it into a black diamond (hex EFBFBD � "REPLACEMENT CHARACTER"). This was chopping to 150 bytes.
  2. When the INSERT into MySQL got it, the 150th byte (hex D0 had been replaced by 3 bytes (EFBFBD) and stored that way in the table.
  3. Note that the table has 152 bytes in it -- see SELECT LENGTH(col), hex(col) ... to 'prove' it. That is, MySQL did not truncate to 150 bytes.
  4. The SELECT faithfully fetched what was stored.
  5. And you displayed such.

So, look at the path taken by the text before getting to the MySQL INSERT.

(I have been tackling charset problems for years; I thought I had seen everything. This was a new one for me!)

Rick James
  • 135,179
  • 13
  • 127
  • 222