1

I have old database with valuable data which was in latin1_swedish_ci collation. I am now trying to convert it to utf8 but I have problems.

I have tried everything and read every article I found. I have several databases and some of this techniques were effective to some database, but are not to this one.

First of all I tried this, which worked on different database

UPDATE tt_news SET 
    title=convert(cast(convert(title using  latin1) as binary) using utf8), 
    short=convert(cast(convert(short using  latin1) as binary) using utf8), 
    bodytext=convert(cast(convert(bodytext using  latin1) as binary) using utf8)
WHERE 1

and it is truncating my data from Balaševiæevi to Bala

Then I tried converting database using this

 mysqldump -u root -p mydb -c -e --default-character-set=utf8 --single-transaction --skip-set-charset --add-drop-database -B mydb  > dump.sql

sed 's/DEFAULT CHARACTER SET latin1/DEFAULT CHARACTER SET utf8 COLLATE utf8_general_ci/' <dump.sql | sed 's/DEFAULT CHARSET=latin1/DEFAULT CHARSET=utf8/' >dump-fixed.sql

mysql -u root -p mydb < dump-fixed.sql

Everything is converted to utf8_general_ci, and data is not truncated but I still have problems with it because characters are not showing as they should.

emir
  • 1,336
  • 1
  • 14
  • 33

1 Answers1

1

There are multiple ways to convert/fix/etc character sets. You picked the wrong way.

ALTER TABLE ... CONVERT TO ...

is the shortest technique to change a correctly encoded table from one encoding to another.

Reference

If, instead, you had "double-encoding" in the columns, something else would be needed.

The hex for š is

9A in latin1
C5A1 in utf8
C385C2A1 if "double-encoded"

More discussion

Community
  • 1
  • 1
Rick James
  • 135,179
  • 13
  • 127
  • 222
  • this is word->Kuæa, this is hex->4B75C3A661, this is binary->4b75c3a661. What to do? – emir Mar 12 '17 at 12:12
  • There should be Kuća – emir Mar 12 '17 at 18:23
  • Kuæa and 4B75C3A661 and 4b75c3a661 are consistent utf8. (The capitalization of hex is not consequential.) Utf8 hex for Kuća is `4B75C48761`. You mentioned "truncation" -- that is also discussed in the link. – Rick James Mar 13 '17 at 05:18
  • I have used some different tools to convert it all to utf8 and to avoid truncation, and this is what I got at the end. is there a way to replace all occurrence of some word in table? – emir Mar 13 '17 at 09:01
  • `UPDATE tbl SET col = 'new' WHERE col = 'old';` changes all cases of the _column_ `col` being 'old' to 'new'. Is that what you are asking? – Rick James Mar 13 '17 at 21:31
  • I want to change one word which is inside any sentence. For example change any occurrence of kuæa to kuća, no matter is it on alone, in the middle of sentence or at the end. I don't want to replace whole content with new content, just to replace faulty world. – emir Mar 14 '17 at 23:57
  • That involves messy code using `SUBSTRING_INDEX()`. Think about doing it in your application language instead. – Rick James Mar 15 '17 at 02:09