I am doing a sentiment analysis project and firstly, I need to clean the text data. Some text contains Chinese, Tagalog and what I am doing now is trying to translate them to English. But until now, all the Chinese characters in this datafile have some Unicode representation like:
<U+5C16>
which could not be coped with using the Python Encoding&Decoding path. So I want to transform this kind of pattern to:
\u5c16
Then I think we could use the following code to get the Chinese characters I want:
text.encode('latin-1').decode('unicode_escape')
So the question now is how to use the regex to transform <U+5C16>
into\u5c16
?
Thank you very much!
Update: I think the most difficult thing here is that I need to let the 5c16
part in \u5c16
be equivalent to the lowercase of the 5C16
in <U+5C16>
. And in my social media dataset, what I see most is the text data like the following:
<U+5C16><U+6C99><U+5480><U+9418><U+6A13>
If I could transform the above text to '\u5c16\u6c99\u5480\u9418\u6a13'
and print it in Python, I could get what I really want:
尖沙咀鐘樓
But how could I do this? Any insights and hints would be appreciated!