Using chardet to detect bad encoding in a MySQL db with JDBC

Question

Recently we moved our mysql db from Latin1 to UTF8. After trying a few different approaches to convert it we weren't able to find any that didn't also introduce some pretty nasty dataloss (and many simply did nothing).

This left me wondering if we have lots of different encodings going on since there doesn't seem to be a single approach that covers our test cases (various posts in our database). To test this theory I wrote a small scala app (my first, feel free to make fun of just how cobbled and non-idiomatic it is! :D) that used chardet to look at the posts and tell me the encoding.

Only one problem, everything is always UTF8.

Here's the code:

package main.scala

import org.mozilla.universalchardet.UniversalDetector
import java.sql.DriverManager

object DBConvert {
  def main(args: Array[String]) {
    val detector = new UniversalDetector(null)
    val db_conn_str = "jdbc:mysql://localhost:3306/mt_pre?user=root"
    val connection = DriverManager.getConnection(db_conn_str)

    try {
        val statement = connection.createStatement()
        val rs = statement.executeQuery("SELECT * FROM mt_entry where entry_id = 3886")
        while (rs.next) {
           val buffer = rs.getBytes("entry_text_more")
           detector.handleData(buffer, 0, buffer.length)
           detector.dataEnd()

           val encoding:String = detector.getDetectedCharset;

           if (encoding != null) println("Detected encoding = " + encoding) else println("No encoding detected.");

           detector.reset();

           // Just so we can see the output
           println(rs.getString("entry_text_more"))
        }
    } catch {
      case _ => e: Exception => println(e.getMessage)
    }
    finally {
        connection.close()
    }
  }
}

I tried passing useUnicode the JDBC query string, also characterEncoding. Neither of them budged the UTF-8 always coming out. Also tried using getBinaryStream and others, still UTF-8.

Fully admit that Character encoding makes my head bend a bit and playing with a new language may not be the best way of fixing this problem. :) That said I'm curious - is there a way to grab the data from the db and detect what encoding it was put in there as, or is it one of those things that simply since it's encoded as UTF-8 in the DB, no matter how you retrieve it that's just what it is (funny characters and all)?

Thanks!

Depending on what are [`character_set_connection`](http://dev.mysql.com/doc/refman/5.5/en/server-system-variables.html#sysvar_character_set_connection) and [`character_set_database`](http://dev.mysql.com/doc/refman/5.5/en/server-system-variables.html#sysvar_character_set_database) settings MySQL will try to perform a conversion. If you data has broken symbols I think you need to know in advance what is the source encoding to deal with that. Or try to convert your field to `BLOB`? — dma_k, Sep 26 '11 at 23:40
Yeah that's the tricky part right now. Some of the data in there is UTF8, some is Latin1, and God knows what else is in there (this is a rather old db that has had a few content systems attached to it). So when I tried the blob technique it converted some wonderfully and truncated others. :( Thus I'm trying to detect first then convert. Seems whatever the agreed upon charset is between JDBC and MySQL is what everything will be in (which makes sense, just doesn't help me. ha) — bnferguson, Sep 26 '11 at 23:58
Try to `cast(your_field as binary)`. MySQL connector should return `byte[]`. — dma_k, Sep 27 '11 at 07:18
That certainly changed a few things up in the display but nothing changed from using getBytes. That said, did more digging and switched character detection libraries and it seems some things are other encodings but the corrupted ones are always UTF-8 (despite showing up correctly when names are set to latin1). Perhaps because the high ascii characters are also valid utf8 characters? At least now I know it's not the db doing it. — bnferguson, Sep 27 '11 at 21:20
I do not know about scala but please take a look at http://stackoverflow.com/q/6824390/1290442 were i solved a similar problem just by changing from String text = fileData.toString(); to String text = new String(fileData.toString().getBytes(), "KOI8_R"); using Java. — Zecas, May 23 '12 at 09:41

score 1 · Answer 1 · edited May 23 '17 at 11:49

1

Once I had a similar problem. See this answer. Setting the encoding inside the connection string may help.

edited May 23 '17 at 11:49

Community

1
1

answered Jun 17 '12 at 19:48

Marco Hemminger

66
1
3

score 0 · Answer 2 · answered Aug 26 '12 at 11:45

Note that the Table Charset and the Connection CHarset and the Default Database Encoding are all same UTF-8. I had one instance in which Datbases default was UTF-8 , but the table coloumns was still Latin so i had some problem. Please see if thats the case.

Using chardet to detect bad encoding in a MySQL db with JDBC

2 Answers2