0

After a decode, I am expecting the 4 bytes of hex code to be replaced by a single \u entry. For instance, \xf0\x9f\x98\x8e is replaced with \u1F60E. Why doesn't decode combine the 4-byte sequences? More specifically, if I want to do a search for a specific emoji, I'd like to use the \u form.

row3 = tweet_table.loc[3, 'tweet']
row3
'#model   i love u take with u all the time in ur\xc3\xb0\xc2\x9f\xc2\x93\xc2\xb1!!! \xc3\xb0\xc2\x9f\xc2\x98\xc2\x99\xc3\xb0\xc2\x9f\xc2\x98\xc2\x8e\xc3\xb0\xc2\x9f\xc2\x91\xc2\x84\xc3\xb0\xc2\x9f\xc2\x91\xc2\x85\xc3\xb0\xc2\x9f\xc2\x92\xc2\xa6\xc3\xb0\xc2\x9f\xc2\x92\xc2\xa6\xc3\xb0\xc2\x9f\xc2\x92\xc2\xa6'

print(row3)
#model   i love u take with u all the time in urð±!!! ððððð¦ð¦ð¦

len(row3)
116

type(row3)
str

row3_u = row3.decode('utf-8',errors="replace")
row3_u
u'#model   i love u take with u all the time in ur\xf0\x9f\x93\xb1!!! \xf0\x9f\x98\x99\xf0\x9f\x98\x8e\xf0\x9f\x91\x84\xf0\x9f\x91\x85\xf0\x9f\x92\xa6\xf0\x9f\x92\xa6\xf0\x9f\x92\xa6'

len(row3_u)
84

print(row3_u)
#model   i love u take with u all the time in urð±!!! ððððð¦ð¦ð¦
fickas
  • 97
  • 4
  • You should drop Python 2 and use Python 3 instead as the latter has built-in Unicode support – ForceBru Mar 20 '18 at 17:56
  • @ForceBru: In spite of that reality there are still scripts that have to use Python 2 for the time being. – Makoto Mar 20 '18 at 17:58
  • I don't know if you're already did, but I think you should check [this answer](https://stackoverflow.com/questions/10798605/warning-raised-by-inserting-4-byte-unicode-to-mysql) – Anwarvic Mar 20 '18 at 19:25

0 Answers0