-1

New user here, please be gentle.

we are looking to implement a piece of python code in c++, but it involves some intricate unicode library called unicodedata, in particular this function

unicodedata.category('A')  # 'L'etter, 'u'ppercase
'Lu'

Any chance that this can be readily achieved in c++? Would embedding compiled python code in c++ be worthwhile, assuming we want to do this in the context of online tensorflow model serving? Thanks!

John Jiang
  • 827
  • 1
  • 9
  • 19

1 Answers1

3

Just stick the output of this Python code into a C++ source file:

import unicodedata

print('typedef enum {Cn, Cc, Cf, Co, Cs, Ll, Lm, Lo, Lt, Lu, Mc, Me, Mn, Nd, Nl, No, Pc, Pd, Pe, Pf, Pi, Po, Ps, Sc, Sk, Sm, So, Zl, Zp, Zs} CATEGORY_e;')
print('const CATEGORY_e CHAR_CATEGORIES[] = {%s};' % ', '.join(unicodedata.category(chr(codepoint)) for codepoint in range(0x110000)))

(If you are still using Python 2.x instead of 3.x, replace chr with unichr.)

You now have a convenient lookup table of Unicode character categories to use in your C++ programs.

dan04
  • 87,747
  • 23
  • 163
  • 198
  • 1
    Ingenius solution! – John Jiang Mar 15 '19 at 22:14
  • Note that this array will have 1 114 112 elements in it, so if you don't want a *huge* *.cpp file to compile, you'll probably want to compress the data. But I figured it would be best to show you the basic idea, and you can optimize it from there. – dan04 Mar 15 '19 at 22:22