How to use the regular expression to transform the into \u5c16?

Question

I am doing a sentiment analysis project and firstly, I need to clean the text data. Some text contains Chinese, Tagalog and what I am doing now is trying to translate them to English. But until now, all the Chinese characters in this datafile have some Unicode representation like:

<U+5C16>

which could not be coped with using the Python Encoding&Decoding path. So I want to transform this kind of pattern to:

\u5c16

Then I think we could use the following code to get the Chinese characters I want:

text.encode('latin-1').decode('unicode_escape')

So the question now is how to use the regex to transform <U+5C16> into\u5c16?

Thank you very much!

Update: I think the most difficult thing here is that I need to let the 5c16 part in \u5c16 be equivalent to the lowercase of the 5C16 in <U+5C16>. And in my social media dataset, what I see most is the text data like the following:

<U+5C16><U+6C99><U+5480><U+9418><U+6A13>

If I could transform the above text to '\u5c16\u6c99\u5480\u9418\u6a13' and print it in Python, I could get what I really want:

尖沙咀鐘樓

But how could I do this? Any insights and hints would be appreciated!

Not quite get what do you mean by "Note the c in \u5c16 could also be u". 5c16 is a hexadecimal number and there can't be u in it. — montonero, Jan 09 '19 at 09:15
Are you _sure_ your file is Ascii with codes like this, or could this be how your editor or pager is showing real unicode characters? — alexis, Jan 09 '19 at 13:08

montonero · Accepted Answer · 2019-01-09T12:22:01.800

1

The required regex is something like this:

find: r'<U\+([A-Fa-f0-9]+?)>'

replace with: r'\u\1'

To turn the resulting string to unicode make s.encode().decode('unicode-escape')

Example:

re.sub(r'<U\+([A-Fa-f0-9]+?)>',r'\u\1',s).encode().decode('unicode-escape')

edited Jan 09 '19 at 12:22

answered Jan 09 '19 at 09:17

montonero

1,363
10
16

Thank you for your reply! I tried your answer but it does not seem to work – Bright Chang Jan 09 '19 at 10:14
Currently, I could transform the '' into '\\u5c16\\u6c99\\u5480\\u9418\\u6a13' by using the following code: ```Python string = '' result1 = re.sub('\', '', result1) result2 ``` But how to get \u5c16\u6c99\u5480\u9418\u6a13 – Bright Chang Jan 09 '19 at 11:48
I've updated my answer with an example how to turn replaced string to unicode characters. – montonero Jan 09 '19 at 12:23
The quantifier `+?` will always match exactly once :-) – alexis Jan 09 '19 at 13:10
@alexis It would. If it wasn't followed by any other character. – montonero Jan 09 '19 at 14:00
oh ok, I see. You were right, I missed that the `>` is in the regex. (But what's the point of the `?` anyway? There's only one way to match.) – alexis Jan 09 '19 at 21:23
@alexis Agree, ? is an overkill and is more a safe measure than a necessity in this case. But it does not hurt. – montonero Jan 10 '19 at 07:19

alexis · Answer 2 · 2019-01-27T09:31:30.690

1

If your file is exactly as you describe, here's how to convert it:

text = "text with <U+5C16> and so on"
ready = re.sub(r"<U\+([0-9a-fA-F]{4})>", r"\u\1", text)
go = re.sub(r"<U\+([0-9a-fA-F]{4})>", r"\u\1", text)    # BMP: 4 hex digits
go = re.sub(r"<U\+([0-9a-fA-F]{5})>", r"\U000\1", go)   # SMP: 5 -> 8 hex digits
print(go.encode("ascii").decode('unicode_escape'))

(The line marked "SMP" is only needed if you have characters outside the "basic multilingual plane".)

Output: text with 尖 and so on

edited Jan 27 '19 at 09:31

answered Jan 09 '19 at 13:20

alexis

48,685
16
101
161

Unfortunately {4} is not a valid quantifier since Unicode character codes aren't bound to 16 bits. – montonero Jan 09 '19 at 14:04
If they are wider, you can't trivially convert them to `\uNNNN` form. If the OP wants to support the SMP as well, a second regex can convert 5-digit sequences to `r"\U000\1"`, (Anyway, `{4}` is a perfectly _valid_ quantifier; it just won't match every valid unicode escape. :-)) – alexis Jan 09 '19 at 16:03
Python doesn't complain against `\uNNNNN` so {4} isn't quite correct. – montonero Jan 10 '19 at 07:19
Of course it doesn't complain; but the fifth digit is not read as part of the codepoint. You'll end up with `{16-bit codepoint}N`. Read the docs. – alexis Jan 10 '19 at 10:38
Anyway I did add handling for SMP characters, as you suggested. (But correctly ;-)) – alexis Jan 10 '19 at 10:40
According to this https://stackoverflow.com/questions/27415935/does-unicode-have-a-defined-maximum-number-of-code-points `The maximum valid code point in Unicode is U+10FFFF` so we could go even further and use \U which support 8 hex chars – montonero Jan 10 '19 at 11:01
Technically yes, but the entire range above `U+F0000` is in the private use area. So my regex is correct for any real Unicode text. (See ftp://ftp.unicode.org/Public/UNIDATA/Blocks.txt). And you have to pad, because `\U` must be followed by exactly 8 digits. – alexis Jan 10 '19 at 12:23

How to use the regular expression to transform the into \u5c16?

2 Answers2