0

I have strings (English words + foreign word + emojis) stored in the Mysql DB.

The data is loaded with

charset = 'latin1'

Then I preproccess the data with

str = str.encode('latin-1').decode('utf-8')

After doing so everything looks good except for the Unicode symbols that look like \u'******'

I would appreciate any help.

com
  • 2,606
  • 6
  • 29
  • 44
  • Can you give an example of how such a string looks, and how it should look like? And please specify how you "look" at the output (`print` to the terminal, write to file or something else). – lenz May 21 '18 at 05:25
  • Note: performing `.encode('latin-1').decode('utf-8')` isn't something you need to do normally, but it's a typical work-around to recover from erroneous encoding from a previous step. – lenz May 21 '18 at 05:27
  • @lenz, I output the strings on the web form and they look like following "98\ud83d\udc2f\ud83d\udc95Puipui Chan" – com May 21 '18 at 23:17
  • Ok, so this happens inside a server-side script, right? Is this CGI or WSGI (or something else)? Can you update the post with some code that shows all operations that happen to the data (fetching, de-/encoding, writing)? – lenz May 22 '18 at 05:37

1 Answers1

1

Don't use encode/decode, it only adds to your woes.

Your description not clear on the path taken for Emoji. Were they correctly encoded in UTF-8, but then mangled when stored into a latin1 column in the table?

Or was it something else?

See "Best practice" in Trouble with UTF-8 characters; what I see is not what I stored

If erroneously stored into latin1 column see "CHARACTER SET latin1, but have utf8 bytes in it; leave bytes alone while fixing charset" in http://mysql.rjweb.org/doc.php/charcoll#fixes_for_various_cases

Rick James
  • 135,179
  • 13
  • 127
  • 222