I am provided a dataset that includes tweets as strings for example in utf-8:
'i love you 💖'
If we encode it using unicode-escape, we get 'i love you \xf0\x9f\x92\x96'
Obviously, Python has treated each byte of the emoji as its own character.
We know that, for example, \xf0\x9f\x92\x96 represents , but I have been unable to actually convert the string to the equivalent string with the correct emojis in them, i.e. 'i love you '
Related: Whatever I implement should also be able to convert 'i love you \xf0\x9f\x92\x96\xf0\x9f\x92\x96' to 'i love you '
How would I do this in Python 3?
Edit: I am being provided the data in this format. I have no control over how this data is generated.
edit2: Some data from the dataset: ð hacienda heights international ðcelebration of building global ð citizens of the ð! #thedistrict
hex code: 0xB0, 0xC2, 0x9F, 0xC2, 0x8E, 0xC2, 0x89, 0x20, 0x68, 0x61, 0x63, 0x69, 0x65, 0x6E, 0x64, 0x61, 0x20, 0x68, 0x65, 0x69, 0x67, 0x68, 0x74, 0x73, 0x20, 0x69, 0x6E, 0x74, 0x65, 0x72, 0x6E, 0x61, 0x74, 0x69, 0x6F, 0x6E, 0x61, 0x6C, 0x20, 0xC3, 0xB0, 0xC2, 0x9F, 0xC2, 0x8E, 0xC2, 0x8A, 0x63, 0x65, 0x6C, 0x65, 0x62, 0x72, 0x61, 0x74, 0x69, 0x6F, 0x6E, 0x20, 0x6F, 0x66, 0x20, 0x62, 0x75, 0x69, 0x6C, 0x64, 0x69, 0x6E, 0x67, 0x20, 0x67, 0x6C, 0x6F, 0x62, 0x61, 0x6C, 0x20, 0xC3, 0xB0, 0xC2, 0x9F, 0xC2, 0x8C, 0xC2, 0x8F, 0x20, 0x63, 0x69, 0x74, 0x69, 0x7A, 0x65, 0x6E, 0x73, 0x20, 0x6F, 0x66, 0x20, 0x74, 0x68, 0x65, 0x20, 0xC3, 0xB0, 0xC2, 0x9F, 0xC2, 0x8C, 0xC2, 0x8F, 0x21, 0x20, 0x23, 0x74, 0x68, 0x65, 0x64, 0x69, 0x73, 0x74, 0x72, 0x69, 0x63, 0x74, 0x20, 0x0A