I have a few txt files to read where there are string such as:
"Yes! Sardines in a can distancing! \uD83E\uDD23"
Problem is that : when I'm doing
"Yes! Sardines in a can distancing! \uD83E\uDD23".encode('utf-16','surrogatepass' ).decode('utf-16)
the unicode point is converted to emoji because python considers \UDD23 or \UD83E as two single characters individually.
output:
Yes! Sardines in a can distancing!
Also, when I see the length of the above string using the len() function, the output is 37.
However when I'm reading the same string from a text file python reads \UDD23 or \UD83E as separate characters i.e 12 characters in total, which I do not want because my encode().decode() function won't give the expected result. That is the unicode points would not be converted to emojis. I used the code below:
count=0
for item in enumerate(list(tweet_dict)):
if item[0]==75:
a=item[1]['text']
print('Length of the string is: ',len(str(a)))
print(a.encode('utf-16', 'surrogatepass').decode('utf-16'))
Output is:
Length of the string is: 47
Yes! Sardines in a can distancing! \uD83E\uDD23