0

I've been using for a long time a database/connection with the wrong encoding, resulting the hebrew language characters in the database to display as unknown-language characters, as the example shows below:

enter image description here

I want to re-import/change the database with the inserted-wrong-encoded characters to the right encoded characters, so the hebrew characters will be displayed as hebrew characters and not as unknown parse like *"× ×תה מסכי×,×× ×©×™× ×ž×¦×™×¢×™× ×œ×™ כמה ×”× "*

For the record, when I display this unknown characters sql data with php - it shows as hebrew. when I'm trying to access it from the phpMyAdmin Panel - it shows as jibrish (these unknown characters).

Is there any way to fix it although there is some data already inserted in the database?

David Mulder
  • 26,123
  • 9
  • 51
  • 114
Eliran
  • 207
  • 1
  • 2
  • 11

2 Answers2

1

That feels like "double-encoded" Hebrew strings.

This partially recovers the text:

UNHEX(HEX(CONVERT('× ×תה מסכי×,××' USING latin1))) 
-->  '� �תה מסכי�,�� 

I do not know what leads to the symbols.

Please do SELECT col, HEX(col) FROM ... WHERE ...; for some cell. I would expect שלום to give hex D7A9D79CD795D79D if it were correctly stored. For "double encoding", I would expect C397C2A9C397C593C397E280A2C397C29D.

Please provide the output from that SELECT, then I will work on how to recover the data.

Edit

Here's what I think happened.

  • The client had characters encoded as utf8; and
  • SET NAMES latin1 lied by claiming that the client had latin1 encoding; and
  • The column in the table declared CHARACTER SET utf8.

Yod did not jump out as a letter, so it took a while to see it. CONVERT(BINARY(CONVERT('×™×™123' USING latin1)) USING utf8) -->יי123

So, I am thinking that that expression will clean up the text. But be cautious; try it on a few rows before 'fixing' the entire table.

UPDATE table SET col = CONVERT(BINARY(CONVERT(col USING latin1)) USING utf8) WHERE ...;

If that does not work, here are 4 fixes for double-encoding that may or may not be equivalent. (Note: BINARY(xx) is probably the same as CONVERT(xx USING binary).)

Rick James
  • 135,179
  • 13
  • 127
  • 222
  • Probably the copy/paste omits some undisplayable characters. Without a hex dump of the actual bytes, it's really hard to reliably reconstruct the input. This is a really old question, but the OP should perhaps consult the [`character-encoding` tag wiki](http://stackoverflow.com/tags/character-encoding/info) for how to include useful diagnostics. – tripleee Jun 25 '15 at 05:40
  • Let's hope that there are some unprintables. Then we can see if it is a straight-forward double-encoding, at which point a solution can be presented. – Rick James Jun 26 '15 at 00:56
  • Ok. I've been using the above HEX query provided by you on some of the hebrew cells - and this is the output: http://s24.postimg.org/qzc0f1y11/cells.png - seem like its double encoded. This database is very big (about 100-200mb) and I am really desperate of finding a way to make this data normal encoded. – Eliran Jun 26 '15 at 17:32
  • This is working perfectly! Thank you very much, now I can use my wrong-encoded databses properly. I appreciate your help and great solution. – Eliran Jun 29 '15 at 19:24
  • I should have emphasize doing `SET NAMES utf8;` (or equivalent, such as `mysqli_set_charset('utf8')`) to prevent future `INSERTs` from messing up. – Rick James Jul 02 '15 at 19:52
0

I am not sure that you can do anything about the data that has already been stored in the database. However, you can import hebrew data properly by making sure you have the correct character set and collation.

  1. the db collation has to be utf8_general_ci
  2. the collation of the table with hebrew has to be utf8_general_ci

for example:

CREATE DATABASE col CHARACTER SET utf8 COLLATE utf8_general_ci;

CREATE TABLE `col`.`hebrew` (
  `id` INT NOT NULL AUTO_INCREMENT,
  `heb` VARCHAR(45) NOT NULL,
  PRIMARY KEY (`id`)
  ) CHARACTER SET utf8
  COLLATE utf8_general_ci;

INSERT INTO hebrew(heb) values ('שלום');

enter image description here

BK435
  • 3,076
  • 3
  • 19
  • 27
  • Sadly i have 100MB~ database with already-stored data as shown above (the unknown characters) and I'm looking for a way to fix it and convert it back. – Eliran Jun 13 '15 at 08:29
  • Once you have identified the current encoding properly you can convert your database. http://stackoverflow.com/questions/6115612/how-to-convert-an-entire-mysql-database-characterset-and-collation-to-utf-8 is a starting point but given your predicament, see the linked questions and tools. Maybe you have to export to a text dump, convert the text, create an empty database with the correct settings, and import the converted text. – tripleee Jun 26 '15 at 03:13