0

I truly wouldn't be asking this if I hadn't tried everything, However I can't seem to decode the following MYSQL description field string.

“Let’s

I've tried to following statement in MYSQL to no avail.

update mytable set mycolumn = 
    convert(binary convert(mycolumn using latin1) using utf8);

This decodes the string to âLetâs go there,â which obviously still isn't correct.

Are there any other pointers someone could pass to solve this issue? There are more examples than just this character encoding across thousands of fields.

Thanks

Juakali92
  • 1,155
  • 8
  • 20

2 Answers2

1

It may not be a character set that MySQL recognizes. Running SHOW CHARACTER SET; returns all charactersets avalable in MySQL, so a script can be run for all of them.

The one that looks right would be the winner, I guess. If none of them look right, you may have to decode this outside of MySQL.

select convert(binary convert(mycolumn using armscii8 ) using utf8) txt, 1 union
select convert(binary convert(mycolumn using ascii ) using utf8) txt, 2 union
select convert(binary convert(mycolumn using big5 ) using utf8) txt, 3 union
select convert(binary convert(mycolumn using binary ) using utf8) txt, 4 union
select convert(binary convert(mycolumn using cp1250 ) using utf8) txt, 5 union
select convert(binary convert(mycolumn using cp1251 ) using utf8) txt, 6 union
select convert(binary convert(mycolumn using cp1256 ) using utf8) txt, 7 union
select convert(binary convert(mycolumn using cp1257 ) using utf8) txt, 8 union
select convert(binary convert(mycolumn using cp850 ) using utf8) txt, 9 union
select convert(binary convert(mycolumn using cp852 ) using utf8) txt, 10 union
select convert(binary convert(mycolumn using cp866 ) using utf8) txt, 11 union
select convert(binary convert(mycolumn using cp932 ) using utf8) txt, 12 union
select convert(binary convert(mycolumn using eucjpms ) using utf8) txt, 13 union
select convert(binary convert(mycolumn using euckr ) using utf8) txt, 14 union
select convert(binary convert(mycolumn using gb18030 ) using utf8) txt, 15 union
select convert(binary convert(mycolumn using gb2312 ) using utf8) txt, 16 union
select convert(binary convert(mycolumn using gbk ) using utf8) txt, 17 union
select convert(binary convert(mycolumn using geostd8 ) using utf8) txt, 18 union
select convert(binary convert(mycolumn using greek ) using utf8) txt, 19 union
select convert(binary convert(mycolumn using hebrew ) using utf8) txt, 20 union
select convert(binary convert(mycolumn using hp8 ) using utf8) txt, 21 union
select convert(binary convert(mycolumn using keybcs2 ) using utf8) txt, 22 union
select convert(binary convert(mycolumn using koi8r ) using utf8) txt, 23 union
select convert(binary convert(mycolumn using koi8u ) using utf8) txt, 24 union
select convert(binary convert(mycolumn using latin1 ) using utf8) txt, 25 union
select convert(binary convert(mycolumn using latin2 ) using utf8) txt, 26 union
select convert(binary convert(mycolumn using latin5 ) using utf8) txt, 27 union
select convert(binary convert(mycolumn using latin7 ) using utf8) txt, 28 union
select convert(binary convert(mycolumn using macce ) using utf8) txt, 29 union
select convert(binary convert(mycolumn using macroman ) using utf8) txt, 30 union
select convert(binary convert(mycolumn using sjis ) using utf8) txt, 31 union
select convert(binary convert(mycolumn using swe7 ) using utf8) txt, 32 union
select convert(binary convert(mycolumn using tis620 ) using utf8) txt, 33 union
select convert(binary convert(mycolumn using ucs2 ) using utf8) txt, 34 union
select convert(binary convert(mycolumn using ujis ) using utf8) txt, 35 union
select convert(binary convert(mycolumn using utf16 ) using utf8) txt, 36 union
select convert(binary convert(mycolumn using utf16le ) using utf8) txt, 37 union
select convert(binary convert(mycolumn using utf32 ) using utf8) txt, 38 union
select convert(binary convert(mycolumn using utf8 ) using utf8) txt, 39 union
select convert(binary convert(mycolumn using utf8mb4 ) using utf8) txt, 40 union
 
Lars Skaug
  • 1,376
  • 1
  • 7
  • 13
0

It comes close to be "double-encoding" of

‘Let‛s

Here are some potentially related characters:

E28098     8216=x2018  [‘]   ON  LEFT SINGLE QUOTATION MARK
E2809A     8218=x201A  [‚]   ON  SINGLE LOW-9 QUOTATION MARK
E2809B     8219=x201B  [‛]   ON  SINGLE HIGH-REVERSED-9 QUOTATION MARK
E2809C     8220=x201C  [“]   ON  LEFT DOUBLE QUOTATION MARK

Please provide SELECT HEX(col), I think there might be a byte that got lost in pasting into this web page. And I think that the â comes from a decoding attempt that failed because of truncation due the that missing byte.

"Double encoding" happens when the text has been incorrectly labeled as latin1, then stored into a utf8 column in the table. Then that is repeated.

More: Trouble with UTF-8 characters; what I see is not what I stored

More on quotes: http://mysql.rjweb.org/doc.php/charcoll#quotes

More on Mojibake: http://mysql.rjweb.org/doc.php/charcoll#8_bit_encodings (see second table)

Rick James
  • 135,179
  • 13
  • 127
  • 222
  • 20C383C2A2C382E282ACC382C5934C6574C383C2A2C382E282ACC382E284A273 This is the hex return of the requested string. Thanks for the information as well. – Juakali92 Aug 09 '20 at 17:02
  • I'm stumped. Can you explain the history of the string -- how it was stored, copied, etc; what the settings were at each step; etc. – Rick James Aug 09 '20 at 23:03
  • This is a small segment from a product's description. This data is initially pulled down via a CSV file provided by one of our suppliers. This is then parsed and filtered into a sim product database. Once a day, products on each of our clients sites are updated via a CRON job from the e-com to product sim DB. After some investigation this evening. This is the raw copy of the string provided directly from our suppliers CSV. “Let’s – Juakali92 Aug 09 '20 at 23:54
  • @Juakali92 - Is the supplier likely to be using some non-English language? There are dozens of character sets. I have not gotten anywhere by assuming latin1 was part of the process. If, instead, some other encoding was used, I might have more success. – Rick James Aug 10 '20 at 02:56
  • They are an American Based company, frustrating a well-known brand which makes it difficult to communicate through the right channels. We have tried asking before to no avail which charset is used. Seems as if we'll need to try and ask again – Juakali92 Aug 10 '20 at 07:19
  • @Juakali92 - I'm baffled. I have a lot of techniques; all have failed on your text. Do you know what the text _should be_ saying? – Rick James Aug 10 '20 at 22:32
  • “Let's is the exact string in which it should be. I've tried a few more myself today to no avail – Juakali92 Aug 10 '20 at 23:25
  • @Juakali92 - That starts with a different quote (double vs single). `E2809C` --> `C3A2 E282AC C593`; this does not quite match what you had. – Rick James Aug 11 '20 at 04:18
  • @Juakali92 - If this is the only case you have, might it be worth manually fixing? – Rick James Aug 11 '20 at 04:20