Unicode: Automatically detect high characters and convert them to surrogate pairs

Asked Jul 31 '14 at 15:13

Active Jul 31 '14 at 15:13

Viewed 145 times

I want to write a program in python that writes a string to a file in a manner that a java application can read it. This java application reads the string in modified UTF-8 format (see this question). It basically means that all unicode characters have to be written as surrogate characters aka UTF-16BE. Is there a graceful, effective method that can do this.

I have tried the following to do so:

import re;
o_string = b''
for splitter in re.split("([\U00010000-\U0010FFFF])", w_string):
    if len(splitter) == 1 and ord(splitter) > 0xFFFF:
        o_string += splitter.encode("UTF-16")
    else:
        o_string += splitter.decode("UTF-8")
print(o_string)

But this seems so bulky (also it uses regex which seems to be en easy to dodge overhead). Can someone propose a better solution to this?

edited May 23 '17 at 11:52

Community

asked Jul 31 '14 at 15:13

WorldSEnder

4,875
2
28
64

2

Does [this question](http://stackoverflow.com/questions/1393004/java-modified-utf-8-strings-in-python) help? (ie, this looks like a duplicate) – Norman Gray Jul 31 '14 at 16:02

Unicode: Automatically detect high characters and convert them to surrogate pairs

0 Answers0