1

After migration from ruby1.8/mysql gem to ruby1.9/mysql2 I get strings from the legacy db that are reported to be utf8, but look like encoded with latin1 (or, probably have some kind of double encoding, as straight force_encoding does not help).

String example:

Ñ„Ñ‹Ð²Ð°Ð¿Ñ€Ð¾Ð»Ð´Ð¶Ñ - just a test string - йцукенгшщзхъ

I want to be able to convert it to

фывапролджэ - just a test string - йцукенгшщзхъ

Can somebody help with conversion a) with ruby code, and/or b) with SQL?

As copy-paste may loose some info, bytes from the returned string: [195, 145, 226, 128, 158, 195, 145, 226, 128, 185, 195, 144, 194, 178, 195, 144, 194, 176, 195, 144, 194, 191, 195, 145, 226, 130, 172, 195, 144, 194, 190, 195, 144, 194, 187, 195, 144, 194, 180, 195, 144, 194, 182, 195, 145, 194, 141, 32, 45, 32, 106, 117, 115, 116, 32, 97, 32, 116, 101, 115, 116, 32, 115, 116, 114, 105, 110, 103, 32, 45, 32, 195, 144, 194, 185, 195, 145, 226, 128, 160, 195, 145, 198, 146, 195, 144, 194, 186, 195, 144, 194, 181, 195, 144, 194, 189, 195, 144, 194, 179, 195, 145, 203, 134, 195, 145, 226, 128, 176, 195, 144, 194, 183, 195, 145, 226, 128, 166, 195, 145, 197, 160]

UncleGene
  • 2,132
  • 16
  • 19
  • 1
    I've had this in the past. It's easy on ruby 1.8 to send UTF8 bytes and have them stored incorrectly as Latin 1. You need to fix the issue MySQL side by first converting the column to a blob, and then back to a utf8 string/text column. This makes MySQL reinterpret the data – Frederick Cheung Apr 19 '13 at 17:34
  • Thanks for a hint in proper direction. Please see my answer below (could not fit content in comment) – UncleGene Apr 19 '13 at 17:59

2 Answers2

1

OK, I found an SQL solution for this in How to fix double-encoded UTF8 characters (in an utf-8 table).

CONVERT(CAST(CONVERT(field USING latin1) AS BINARY) USING utf8)

Any takers for Ruby?

Community
  • 1
  • 1
UncleGene
  • 2,132
  • 16
  • 19
  • If your database content is mangled, you need to de-mangle it there. No amount of fussing in Ruby will fix it. – tadman Apr 19 '13 at 18:08
  • Benefit of ruby conversion is that it can be handled lazily (apparently the approach is idempotent). Database conversion is tricky with timing - with any deployment strategy you have old clients looking in new records and/or new clients looking at old ones. – UncleGene Apr 19 '13 at 18:16
  • Disregard the idempotency comment, it related to a wrong solution – UncleGene Apr 19 '13 at 18:27
  • You can do an in-place `UPDATE` if you want to fix the problem permanently. I'd suggest doing that on a test copy of the database before going at your production data. I've had data doubly encoded as UTF-8 before due to a connection misconfiguration. Annoying to fix, but necessary. Depending on the amount of data you need to convert, this could take anywhere from minutes to hours. – tadman Apr 19 '13 at 18:34
0

You might want to set the encoding of your database connection in config/database.yml, experiment with different settings until you get the desired result.

It could be your connection is defaulting to latin1 for some reason, but being reinterpreted as UTF8 internally.

tadman
  • 208,517
  • 23
  • 234
  • 262
  • I tried all possible connection encoding settings, and I get garbage for all of them. Also, I want the code to use utf8 - all new records are recorded/retrieved correctly, the issue is only with legacy records – UncleGene Apr 19 '13 at 17:28