1

I have a MySQL database with German "Umlaute öäü" in a table and I need to write a Go app that reads the table, encode it to ISO-8859-1 and write it to a text file.

So far so good, but the encoding to iso-8859-1 is not working. I have tried to debug this.

Here some details and information:

MySQL The MySQL database is UTF8, also the table itself. Also other character sets should be fine, except the character_set_server, but I think this is not relevant here, it should be just a default for new databases as far as I know.

When I query the database with the following SQL, I get the correct UTF8 encoded text:

select street, hex(street) from test_table where id = '36'

Result: (in real it is called Fröbelstraße)
Fröbelstraße, 4672C3B662656C73747261C39F65

So from the hex string it is basically exact what I have expected. OK.

Go App Just the relevant parts....

db, err := sql.Open("mysql", "...<connection string>...")
res, err := db.Query("select street from from test_table where id = '36'")

for res.Next() {
var pb Phonebook
        err := res.Scan(&pb.Street)
        fmt.Println(hex.EncodeToString([]byte(pb.Street)))
}

The output is 4672c383c2b662656c73747261c383c5b865

And that's the problem why my encoding to ISO-8859-1 is not working because the string from the database is not correct. The hex from the db direct query is correct and also working the the encoding.

But I don't understand why I get a different string from the go client.

In the original string "Fröbelstraße" are 2 characters "ö" which is C3B6 and "ß" which is C39F. The hex from the query with a db client is ok, but the one from the go app is too long, because I get a 2 bytes more per character.

When I feed my latin1 converter, with the correct hex string, it is working fine, I get an iso-8859-1 string. But not from the other one I query directly from Go.

I do this with

    d := charmap.ISO8859_1.NewEncoder()
    out, err := d.String(inp)

Also just a snippet, I actually call a function with a string, but I never got the correct iso8859-1 result. So I tried it with the hex code from the MySQL client query, converted back to a string, and bingo the iso8859-1 result is correct.

I also tried to query from python and get there the same strange hex from the queried string, so I am completely lost of what is wrong here. Cannot be go, because it is the same in python. But in the db it is stored correctly in my opinion and the MySQL character sets are all set the utf8mb4 or utf8, except the one I mentioned in above.

halfer
  • 19,824
  • 17
  • 99
  • 186
SmithyAT
  • 121
  • 1
  • 8
  • 1
    sounds to me like your sql client character set is iso-8859-1 so your results are already in iso-8859-1. Doesn't the mysql server character set also define defualt per-connection character set? – erik258 Jan 22 '22 at 22:24
  • 2
    check the table/column character set collation and connect wit the same character set collation and convert then to whatever you like. – nbk Jan 22 '22 at 22:57
  • @nbk you are absolutely right, that was the problem, but I haven't checked the comments here earlier. I was not aware that each column also has its own character set. Thanks! – SmithyAT Jan 22 '22 at 23:00

2 Answers2

3

4672c383c2b662656c73747261c383c5b865 appears to be "double-encoded". Look for that in Trouble with UTF-8 characters; what I see is not what I stored

However, since you were getting the hex from the app, not from the table, it is inconclusive. Please do SELECT HEX(col) FROM ... to see if you get exactly that string. Often, apps, especially browsers, try to "fix" the problem, thereby making it harder do diagnose correctly.

If you have "CHARACTER SET utf8mb4 with double-encoding", then this may cure the data:

UPDATE tbl SET col = CONVERT(BINARY(CONVERT(col USING latin1)) USING utf8mb4);

However, you need to fix the source of the data, too.

Rick James
  • 135,179
  • 13
  • 127
  • 222
  • 1
    Many thanks Rick, your linked article helped a lot to understand the underlaying problem and how to fix it. I have learned the last days and due to your article, that it is very important to have consistent settings over the whole environment. Just a single setting which points to a wrong char set can lead in very strange and unexpected problems, which are hard to debug and understand! – SmithyAT Jan 23 '22 at 11:42
  • @SmithyAT - Yes. Nicely put. (It took me years to discover how many different ways there were to mess things up, _and_ figure out what to do about each.) – Rick James Jan 23 '22 at 17:40
1

After two days of completely lost, I found the root cause myself. Strange that it happened shortly after I posted the question here.

I wanted to try a different mysql server and therefore dumped the table. And then I have seen in the dump that each field has its own character set definition which was latin1 in my case.

So that explains why it was a strange result. I just created a correct encoded test table and it works now as expected.

Now I have to think how I can "repair" these encodings, maybe a dump/restore will do it, but that's a another story.

SmithyAT
  • 121
  • 1
  • 8