C++ implementation of python unicodedata library

Question

New user here, please be gentle.

we are looking to implement a piece of python code in c++, but it involves some intricate unicode library called unicodedata, in particular this function

unicodedata.category('A')  # 'L'etter, 'u'ppercase
'Lu'

Any chance that this can be readily achieved in c++? Would embedding compiled python code in c++ be worthwhile, assuming we want to do this in the context of online tensorflow model serving? Thanks!

dan04 · Answer 1 · 2019-03-15T22:58:05.120

3

Just stick the output of this Python code into a C++ source file:

import unicodedata

print('typedef enum {Cn, Cc, Cf, Co, Cs, Ll, Lm, Lo, Lt, Lu, Mc, Me, Mn, Nd, Nl, No, Pc, Pd, Pe, Pf, Pi, Po, Ps, Sc, Sk, Sm, So, Zl, Zp, Zs} CATEGORY_e;')
print('const CATEGORY_e CHAR_CATEGORIES[] = {%s};' % ', '.join(unicodedata.category(chr(codepoint)) for codepoint in range(0x110000)))

(If you are still using Python 2.x instead of 3.x, replace chr with unichr.)

You now have a convenient lookup table of Unicode character categories to use in your C++ programs.

edited Mar 15 '19 at 22:58

answered Mar 15 '19 at 21:47

dan04

87,747
23
163
198

1

Ingenius solution! – John Jiang Mar 15 '19 at 22:14
Note that this array will have 1 114 112 elements in it, so if you don't want a *huge* *.cpp file to compile, you'll probably want to compress the data. But I figured it would be best to show you the basic idea, and you can optimize it from there. – dan04 Mar 15 '19 at 22:22

C++ implementation of python unicodedata library

1 Answers1