0

I am doing a sentiment analysis project and firstly, I need to clean the text data. Some text contains Chinese, Tagalog and what I am doing now is trying to translate them to English. But until now, all the Chinese characters in this datafile have some Unicode representation like:

<U+5C16>

which could not be coped with using the Python Encoding&Decoding path. So I want to transform this kind of pattern to:

\u5c16

Then I think we could use the following code to get the Chinese characters I want:

text.encode('latin-1').decode('unicode_escape')

So the question now is how to use the regex to transform <U+5C16> into\u5c16?

Thank you very much!


Update: I think the most difficult thing here is that I need to let the 5c16 part in \u5c16 be equivalent to the lowercase of the 5C16 in <U+5C16>. And in my social media dataset, what I see most is the text data like the following:

<U+5C16><U+6C99><U+5480><U+9418><U+6A13>

If I could transform the above text to '\u5c16\u6c99\u5480\u9418\u6a13' and print it in Python, I could get what I really want:

尖沙咀鐘樓

But how could I do this? Any insights and hints would be appreciated!

Bright Chang
  • 191
  • 2
  • 14
  • Not quite get what do you mean by "Note the c in \u5c16 could also be u". 5c16 is a hexadecimal number and there can't be u in it. – montonero Jan 09 '19 at 09:15
  • @montonero Oh, I have edited my question. Thank you! – Bright Chang Jan 09 '19 at 10:06
  • Are you _sure_ your file is Ascii with codes like this, or could this be how your editor or pager is showing real unicode characters? – alexis Jan 09 '19 at 13:08

2 Answers2

1

The required regex is something like this:

find: r'<U\+([A-Fa-f0-9]+?)>'

replace with: r'\u\1'

To turn the resulting string to unicode make s.encode().decode('unicode-escape')

Example:

re.sub(r'<U\+([A-Fa-f0-9]+?)>',r'\u\1',s).encode().decode('unicode-escape')
montonero
  • 1,363
  • 10
  • 16
  • Thank you for your reply! I tried your answer but it does not seem to work – Bright Chang Jan 09 '19 at 10:14
  • Currently, I could transform the '' into '\\u5c16\\u6c99\\u5480\\u9418\\u6a13' by using the following code: ```Python string = '' result1 = re.sub('\', '', result1) result2 ``` But how to get \u5c16\u6c99\u5480\u9418\u6a13 – Bright Chang Jan 09 '19 at 11:48
  • I've updated my answer with an example how to turn replaced string to unicode characters. – montonero Jan 09 '19 at 12:23
  • The quantifier `+?` will always match exactly once :-) – alexis Jan 09 '19 at 13:10
  • @alexis It would. If it wasn't followed by any other character. – montonero Jan 09 '19 at 14:00
  • oh ok, I see. You were right, I missed that the `>` is in the regex. (But what's the point of the `?` anyway? There's only one way to match.) – alexis Jan 09 '19 at 21:23
  • @alexis Agree, ? is an overkill and is more a safe measure than a necessity in this case. But it does not hurt. – montonero Jan 10 '19 at 07:19
1

If your file is exactly as you describe, here's how to convert it:

text = "text with <U+5C16> and so on"
ready = re.sub(r"<U\+([0-9a-fA-F]{4})>", r"\u\1", text)
go = re.sub(r"<U\+([0-9a-fA-F]{4})>", r"\u\1", text)    # BMP: 4 hex digits
go = re.sub(r"<U\+([0-9a-fA-F]{5})>", r"\U000\1", go)   # SMP: 5 -> 8 hex digits
print(go.encode("ascii").decode('unicode_escape'))

(The line marked "SMP" is only needed if you have characters outside the "basic multilingual plane".)

Output: text with 尖 and so on

alexis
  • 48,685
  • 16
  • 101
  • 161
  • Unfortunately {4} is not a valid quantifier since Unicode character codes aren't bound to 16 bits. – montonero Jan 09 '19 at 14:04
  • If they are wider, you can't trivially convert them to `\uNNNN` form. If the OP wants to support the SMP as well, a second regex can convert 5-digit sequences to `r"\U000\1"`, (Anyway, `{4}` is a perfectly _valid_ quantifier; it just won't match every valid unicode escape. :-)) – alexis Jan 09 '19 at 16:03
  • Python doesn't complain against `\uNNNNN` so {4} isn't quite correct. – montonero Jan 10 '19 at 07:19
  • Of course it doesn't complain; but the fifth digit is not read as part of the codepoint. You'll end up with `{16-bit codepoint}N`. Read the docs. – alexis Jan 10 '19 at 10:38
  • Anyway I did add handling for SMP characters, as you suggested. (But correctly ;-)) – alexis Jan 10 '19 at 10:40
  • According to this https://stackoverflow.com/questions/27415935/does-unicode-have-a-defined-maximum-number-of-code-points `The maximum valid code point in Unicode is U+10FFFF` so we could go even further and use \U which support 8 hex chars – montonero Jan 10 '19 at 11:01
  • Technically yes, but the entire range above `U+F0000` is in the private use area. So my regex is correct for any real Unicode text. (See ftp://ftp.unicode.org/Public/UNIDATA/Blocks.txt). And you have to pad, because `\U` must be followed by exactly 8 digits. – alexis Jan 10 '19 at 12:23