-3

I am provided a dataset that includes tweets as strings for example in utf-8:

'i love you 💖'

If we encode it using unicode-escape, we get 'i love you \xf0\x9f\x92\x96'

Obviously, Python has treated each byte of the emoji as its own character.

We know that, for example, \xf0\x9f\x92\x96 represents , but I have been unable to actually convert the string to the equivalent string with the correct emojis in them, i.e. 'i love you '

Related: Whatever I implement should also be able to convert 'i love you \xf0\x9f\x92\x96\xf0\x9f\x92\x96' to 'i love you '

How would I do this in Python 3?

Edit: I am being provided the data in this format. I have no control over how this data is generated.

edit2: Some data from the dataset: ð hacienda heights international ðcelebration of building global ð citizens of the ð! #thedistrict

hex code: 0xB0, 0xC2, 0x9F, 0xC2, 0x8E, 0xC2, 0x89, 0x20, 0x68, 0x61, 0x63, 0x69, 0x65, 0x6E, 0x64, 0x61, 0x20, 0x68, 0x65, 0x69, 0x67, 0x68, 0x74, 0x73, 0x20, 0x69, 0x6E, 0x74, 0x65, 0x72, 0x6E, 0x61, 0x74, 0x69, 0x6F, 0x6E, 0x61, 0x6C, 0x20, 0xC3, 0xB0, 0xC2, 0x9F, 0xC2, 0x8E, 0xC2, 0x8A, 0x63, 0x65, 0x6C, 0x65, 0x62, 0x72, 0x61, 0x74, 0x69, 0x6F, 0x6E, 0x20, 0x6F, 0x66, 0x20, 0x62, 0x75, 0x69, 0x6C, 0x64, 0x69, 0x6E, 0x67, 0x20, 0x67, 0x6C, 0x6F, 0x62, 0x61, 0x6C, 0x20, 0xC3, 0xB0, 0xC2, 0x9F, 0xC2, 0x8C, 0xC2, 0x8F, 0x20, 0x63, 0x69, 0x74, 0x69, 0x7A, 0x65, 0x6E, 0x73, 0x20, 0x6F, 0x66, 0x20, 0x74, 0x68, 0x65, 0x20, 0xC3, 0xB0, 0xC2, 0x9F, 0xC2, 0x8C, 0xC2, 0x8F, 0x21, 0x20, 0x23, 0x74, 0x68, 0x65, 0x64, 0x69, 0x73, 0x74, 0x72, 0x69, 0x63, 0x74, 0x20, 0x0A

Peter Lo
  • 29
  • 2
  • Is it a web application? – Tarik Sep 24 '21 at 08:03
  • Hope this will help [escaped-unicode-to-emoji-in-python](https://stackoverflow.com/questions/54559885/escaped-unicode-to-emoji-in-python) – Sangeerththan Balachandran Sep 24 '21 at 08:14
  • 1
    Does the original dataset already contain mojibake and you're required to fix this after the fact, or are you merely mistreating the dataset and are producing that mojibake for yourself? The solutions to these two situations are very different. – deceze Sep 24 '21 at 08:21
  • Please review https://meta.stackoverflow.com/questions/379403/problematic-questions-about-decoding-errors – tripleee Sep 24 '21 at 10:31
  • Having reviewed it I am struggling to see how that helps me. – Peter Lo Sep 24 '21 at 10:46
  • I expected the data, which comes from a csv file of a tweet dataset, to be in utf-8. It's treating each 8-bit segment of the emoji as a separate symbol. I can't seem to import it in such a way as to make it treat the entire sequence as an emoji. – Peter Lo Sep 24 '21 at 10:49
  • What is the actual encoding in the actual CSV file? Inspect it with a hex editor. It's unclear whether the text is already incorrectly saved into the CSV, or whether you're just reading the CSV incorrectly. – deceze Sep 24 '21 at 11:00
  • To be clear here: "💖" is *not* "UTF-8". It's the **characters** ð, Ÿ, ’, and –. And it's *probably* garbage, because some **bytes** have been interpreted using the wrong encoding **at some point**. We have zero information here to know where exactly that misinterpretation happened. It may be as simple as your Python program needing to read the CSV file correctly. Or it may be as complicated as needing to fix garbage produced earlier by some other program. – deceze Sep 24 '21 at 11:25
  • utf-8 is the only encoding that remotely makes sense as in the English text comes out correctly. Everything else throws out complete garbage. One solution I can think of is to manually create a conversion function that converts each text representation of the bunch of hex codes into the correct emoji but I wonder if there is a faster way. – Peter Lo Sep 24 '21 at 13:06
  • 1
    Again, we ain't going anywhere with this unless you bust out a hex editor and give us a sample of the raw bytes that this text sequence is saved as. Then we can tell you how it's encoded and how to properly read it. – deceze Sep 24 '21 at 13:07
  • Edited the question to show an example tweet in plaintext and the underlying hex codes. – Peter Lo Sep 24 '21 at 13:18

2 Answers2

1

The hex codes indicate you have double encoded mojibake (and are missing a byte...typo?) This decodes it:

# hex codes provided
s = '0xB0, 0xC2, 0x9F, 0xC2, 0x8E, 0xC2, 0x89, 0x20, 0x68, 0x61, 0x63, 0x69, 0x65, 0x6E, 0x64, 0x61, 0x20, 0x68, 0x65, 0x69, 0x67, 0x68, 0x74, 0x73, 0x20, 0x69, 0x6E, 0x74, 0x65, 0x72, 0x6E, 0x61, 0x74, 0x69, 0x6F, 0x6E, 0x61, 0x6C, 0x20, 0xC3, 0xB0, 0xC2, 0x9F, 0xC2, 0x8E, 0xC2, 0x8A, 0x63, 0x65, 0x6C, 0x65, 0x62, 0x72, 0x61, 0x74, 0x69, 0x6F, 0x6E, 0x20, 0x6F, 0x66, 0x20, 0x62, 0x75, 0x69, 0x6C, 0x64, 0x69, 0x6E, 0x67, 0x20, 0x67, 0x6C, 0x6F, 0x62, 0x61, 0x6C, 0x20, 0xC3, 0xB0, 0xC2, 0x9F, 0xC2, 0x8C, 0xC2, 0x8F, 0x20, 0x63, 0x69, 0x74, 0x69, 0x7A, 0x65, 0x6E, 0x73, 0x20, 0x6F, 0x66, 0x20, 0x74, 0x68, 0x65, 0x20, 0xC3, 0xB0, 0xC2, 0x9F, 0xC2, 0x8C, 0xC2, 0x8F, 0x21, 0x20, 0x23, 0x74, 0x68, 0x65, 0x64, 0x69, 0x73, 0x74, 0x72, 0x69, 0x63, 0x74, 0x20, 0x0A'
# convert to bytes object
b = bytes([int(x,0) for x in s.split(', ')])
print(b)
# This looks like UTF-8, but is missing a valid start byte.  I guessed what it was based on the data:
b = b'\xc3' + b
# This decodes, but is wrong because it was originally UTF-8 decoded incorrectly as ISO-8859-1 (aka latin1)
print(repr(b.decode('utf8')))
# Undo by encoding as latin1 and decoding correctly as UTF8
s = b.decode('utf8').encode('latin1').decode('utf8')
print(s)

Output:

b'\xb0\xc2\x9f\xc2\x8e\xc2\x89 hacienda heights international \xc3\xb0\xc2\x9f\xc2\x8e\xc2\x8acelebration of building global \xc3\xb0\xc2\x9f\xc2\x8c\xc2\x8f citizens of the \xc3\xb0\xc2\x9f\xc2\x8c\xc2\x8f! #thedistrict \n'
'ð\x9f\x8e\x89 hacienda heights international ð\x9f\x8e\x8acelebration of building global ð\x9f\x8c\x8f citizens of the ð\x9f\x8c\x8f! #thedistrict \n'
 hacienda heights international celebration of building global  citizens of the ! #thedistrict 

So your dataset was written in UTF-8, but at some point the original data was provided in UTF-8 but read as ISO-8859-1, then written as UTF-8 again. If your dataset is in a file you could correct it with:

with open('dataset.txt',encoding='utf8') as f:
    data = f.read().encode('latin1').decode('utf8')
with open('corrected.txt','w',encoding='utf8') as f:
    f.write(data)
Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251
0

You can simply decode the encoded bytes with utf-8:

print(b'i love you \xf0\x9f\x92\x96'.decode('utf-8'))

This outputs:

i love you 
blhsing
  • 91,368
  • 6
  • 71
  • 106
  • This doesn't work: Note that the string 'i love you \xf0\x9f\x92\x96' is not of type byte. It is of type str, and you cannot decode that. – Peter Lo Sep 24 '21 at 10:28
  • 1
    @PeterLo You missed the `b'...'` prefix. Or are you now asking how to convert a string into a `bytes` sequence? `string.decode('latin-1')` (obscurely) does that. – tripleee Sep 24 '21 at 10:32
  • The data is supplied as a csv I can't just add a b'...' to it. – Peter Lo Sep 24 '21 at 10:38