0

I have a db with content like

  https://pastebin.com/LRgTqHjN

(stored with the wrong encoding)

Save it to a file record.sql

    mysql -u root --password=root enctest < record.sql

get the same garbage as input;

   mysqldump  -u root --password=root enctest 

but if you add;

   mysqldump  -u root --password=root --default-character-set=latin1 enctest

it shows the correct content.

I want to reproduce the latter behaviour in C#, I use the following code:

  "MySqlConnection": "server=localhost;port=3306;database=enctest;user=root;password=root",

and

            MySqlCommand cmd = new MySqlCommand($"select * from siteTxt", conn);
            using (MySqlDataReader reader = (MySqlDataReader)(await cmd.ExecuteReaderAsync()))
            {
                     for (int i=0;i<6;i++)
                     {
                           
                                    row.Add(reader.GetString(i));
                     }
           }

But that shows the same garbage as:

   mysqldump  -u root --password=root enctest 

How can I make it show the same output as the second mysqldump?

Note: the real table is a few terabytes in size; I cannot just dump it and reimport it (which would fix the issue) which is why I need to fix it in the C# reader.

Update: for viewing, a solution that works is to change the select query to:

CONVERT(BINARY(CONVERT({column} USING latin1)) USING utf8mb4)

per string type column which obviously is less than ideal, but it works well.

CharlesS
  • 1,563
  • 2
  • 18
  • 31
  • 1
    So look for other misconfigurations. Because those UTF8 bytes remain UTF8 bytes. And that has **nothing to do with any problem in C#. Post your code and an actual reproduction of the problem**. Something that actually produces mangled text – Panagiotis Kanavos Nov 25 '20 at 08:16
  • With all databases I have that properly stored UTF8 as UTF8, it works just fine in my console without anything added. It's just this db that behaves differently. phpmyadmin shows the same issue with this db. – CharlesS Nov 25 '20 at 08:18
  • `msg` is already a string. It doesn't need any encoding conversion. – Panagiotis Kanavos Nov 25 '20 at 08:18
  • This is the code that produces the problem; it's a straight copy/paste. – CharlesS Nov 25 '20 at 08:19
  • YES BECAUSE YOU EXPLICITLY MANGLE THE TEXT BY CONVERTING IT TO LATIN1. The second snippet does nothing but mangle the text. What does `msg` contain before it gets mangled? – Panagiotis Kanavos Nov 25 '20 at 08:20
  • `byte[] utfBytes = iso.GetBytes(msg);` no, those aren't "UTF" bytes, they are mangled data. [Read the table of Latin1 characters](https://en.wikipedia.org/wiki/ISO/IEC_8859-1). No Greek, no Chinese there. `iso.GetBytes(msg)` is going to replace *all* characters outside that range with `?`. – Panagiotis Kanavos Nov 25 '20 at 08:23
  • I'm saying I played around with that; I removed it again because it does not work. So it's not using that convert code; it's just straight getting from the db; here a dump of a record from the db; https://pastebin.com/LRgTqHjN – CharlesS Nov 25 '20 at 08:24
  • That's already mangled. The dump you posted contains `ΣÏνδεση` instead of the proper contents. – Panagiotis Kanavos Nov 25 '20 at 08:25
  • Thatt's what's in the db, but if you import it into mysql and you use the mysql cli with --default-character-set=latin1 it works fine, so it's possible to get it out properly. – CharlesS Nov 25 '20 at 08:25
  • I added the steps to reproduce the issue. – CharlesS Nov 25 '20 at 08:28
  • 1
    Not exactly, as this has nothing to do with .NET. You have to post the steps that would end up storing mangled text in a MySQL database - create the table with the same encodings, insert text so it gets mangled the same way. This means you should include the MySQL dumps in the question, not the C# code. You have to make `MySqlConnection` to do the conversion. If you try to do that afterwards you'd be trying to reverse *two* conversions*. Perhaps you can add `;charset=utf8` or `;charset=latin1` in your connection string? `utf8mb4` is also available – Panagiotis Kanavos Nov 25 '20 at 08:43
  • I posted the connection string and the dump as well; you can totally reproduce this locally (I understand you have something better to do :). The simplicity of this is that the data *is* wrong in the database and I inherited it so I have no clue why / how it is wrong, but mysql-client/dump can get it out correctly, so C# should be able to do that too. I edited the question to be short and allow to reproduce exactly. – CharlesS Nov 25 '20 at 08:45
  • Show us the garbage. See https://stackoverflow.com/questions/38363566/trouble-with-utf8-characters-what-i-see-is-not-what-i-stored for typical cases. – Rick James Nov 29 '20 at 20:35
  • Show us the hex of what is in the table -- `SELECT HEX(col) ...` – Rick James Nov 29 '20 at 20:36
  • `ΣÏνδεσÎ` is "mojibake", possibly for Greek. I see `εσ` at the end. – Rick James Nov 29 '20 at 20:39
  • Do not use latin1 -- Greek cannot be represented in latin1. – Rick James Nov 29 '20 at 20:40
  • `CHARACTER SET utf8` (or `utf8mb4`) can handle Greek. A few other charsets can handle Greek, but you should go for utf8mb4. – Rick James Nov 29 '20 at 20:42
  • The request for HEX will determine whether it was mangled _going into_ the database or _coming out_. – Rick James Nov 29 '20 at 20:43
  • The dump uses latin1 -- bad. The connection does not specify a charset -- bad. – Rick James Nov 29 '20 at 20:44
  • `CONVERT(BINARY(CONVERT('εσ' USING latin1)) USING utf8mb4)` gives 'εσ', but you should _not_ have to use that in production code. – Rick James Nov 29 '20 at 20:46

0 Answers0