Build a token for Simplified Chinese Identifiers

Question

I'm trying to build a token for Simplified Chinese Identifiers.

Simplified Chinese Identifiers are defined in the specification as follows:

simplified-Chinese-identifier = first-sChinese-identifier-character *subsequent-sChinese-identifier-character
first-sChinese-identifier-character = (first-Latin-identifier-character / CP936-initialcharacter)
subsequent-sChinese-identifier-character = (subsequent-Latin-identifier-character / CP936-
subsequent-character)
CP936-initial-character = < character ranges specified in section 3.3.5.1.3>
CP936-subsequent-character = < character ranges specified in section 3.3.5.1.3>

Here is UNICODE-BESTFIT and Windows Codepage 936.

What I did is, for instance, to look for %xA3C1 in the page, then take its corresponding code, which is 0xff21. Thus, I found the corresponding code for %xA3C1,%xA3DA, %xA3E1, %xA3FA, %xA1A2, %xA1AA, %xA1AC, %xA1AD, %xA1B2, %xA1E6; %xA1E8, %xA1EF, %xA2B1, %xA2FC, %xA4A1, %xFE4F, and build CP936-initial-character as follows:

let cP936_initial_character = [%sedlex.regexp? 0xff21 .. 0xff3a | 0xff41 .. 0xff5a | 0x3001 .. 0x2014 | 0x2016 .. 0x2026 | 0x3014 .. 0x2103 | 0x00a4 .. 0x2605 | 0x2488 .. 0x216b | 0x3041 .. 0xfa29]

However, the problem is that some ranges look odd, for example, 0x00a4 .. 0x2605 and 0x2488 .. 0x216b which are not in a good order; and 0x3041 .. 0xfa29 looks too large.

Does anyone know what's the correct way to build this token?

A side note: there must be a mistake in [](https://docs.microsoft.com/en-us/openspecs/microsoft_general_purpose_programming_languages/ms-vbal/ab0dc447-42df-4390-84ef-92147726768c) enumeration. For instance, in file `bestfit936.txt` 24984th line `0x252c 0xa9d0 ;┬` is _Box Drawings Light Down And Horizontal_ and the `0xa9d0` is in range `%xA4A1-FE4F`… BTW, follow the `WindowsBestFit/readme.txt` rules: _Each file is encoded in the code page it describes_, eg: `bestfit936.txt` is encoded in the `gb2312` encoding. Then you can see `0xff21 0xa3c1 ;Ａ` (instead of `;��`). — JosefZ, Aug 20 '21 at 09:29

JosefZ · Accepted Answer · 2021-08-22T14:45:03.143

Follow WindowsBestFit/readme.txt; in particular, description of multibyte mapping records in the WCTABLE section (the WCTABLE tag marks the start of the Unicode UTF-16 (WideChar) to "MultiByte" bytes…).

The following, partially commented, Python 3 script (sorry, I don't speak VBA):

reads (previously downloaded) bestfit936.txt file line by line,
parses its WCTABLE section and builds an array of Unicode codepoints which gb2312 codepoint (codepage 936) matches rules of Simplified Chinese Identifiers and which Unicode Category is letter ('Ll','Lu','Lo') (see variable init_chars_16),
sorts codepoints in variable init_chars_16 and creates corresponding array of characters (variable init_chars_utf16),
groups codepoints in variable init_chars_16 to longest consecutive chains (variable init_chars_groups), and
prints a few characters, their codepoints and corresponding consecutive ranges (supply argument 1 to print all characters). There is 15477 applicable codepoints in (unfortunately) 1977 consecutive ranges.

This is done only for CP936-initial-character however the same could be applied for CP936-subsequent-character as well (supply argument 2, see also Usage below and output examples).

from itertools import groupby
from operator  import itemgetter
import unicodedata
import sys

if ( len(sys.argv) > 1 and [1,2,'1','2'].__contains__( sys.argv[1]) ):
  init_chars_test = int(sys.argv[1])
  demo_chars_test = False
else:
  init_chars_test = 1
  demo_chars_test = True

def first_last( some_list ):
  if len( some_list ) > 1:
    return '..'.join( [ '0x{:04x}'.format(some_list[0]),
                        '0x{:04x}'.format(some_list[-1]) ] )
  else:
    return '0x{:04x}'.format(some_list[0])

if init_chars_test == 1:
  unicode_category = ['Ll','Lu','Lo']           # letters
  print( str('DEMO ' if demo_chars_test else '')
          + 'CP936-initial-character:', unicode_category, '\n')
  init_chars_CP936 = sorted(
    list( range( 0xA3C1,0xA3DA +1)) + 
    list( range( 0xA3E1,0xA3FA +1)) +
    list( range( 0xA1A2,0xA1AA +1)) +
    list( range( 0xA1AC,0xA1AD +1)) +
    list( range( 0xA1B2,0xA1E6 +1)) +
    list( range( 0xA1E8,0xA1EF +1)) +
    list( range( 0xA2B1,0xA2FC +1)) +
    list( range( 0xA4A1,0xFE4F +1)) )
else:
  unicode_category=['Ll','Lu','Lo','Nd','Pc'] # letters or numbers or underscore
  print( 'CP936-subsequent-character addendum:', unicode_category, '\n')
  init_chars_CP936 = sorted(
    list( range( 0xA3DF,0xA3DF +1)) +
    list( range( 0xA3B0,0xA3B9 +1)) )

wctable = False
init_chars_16 = []     # Unicode UTF-16 codepoints (as integers) 
init_chars_undef = []  # 
i = 0
with open(r'D:\Downloads\Unicode\bestfit936.txt',
  mode='r', encoding='gb2312', errors='backslashreplace') as ff:
  for rawline in ff:
    i+=1
    line = rawline.split('\t')
    if line[0].upper() in [
      'CODEPAGE','CPINFO','MBTABLE','DBCSRANGE','DBSCTABLE','ENDCODEPAGE']:
        wctable = False
    if  wctable:
      if len(line) >1:
        code_936 = int(line[1],16) if line[1].lower().startswith('0x') else 0
        if code_936 in init_chars_CP936:
          code16be = int(line[0],16) if line[0].lower().startswith('0x') else 0
          # µ vs. μ error 
          # 0x00b5  0xa6cc  ;μ     # at line 24608: 'Micro Sign'
          # 0x03bc  0xa6cc  ;μ     # at line 24718: 'Greek Small Letter Mu'
          if ( code16be > 0x00ff   # exclude 0x00b5 and permit only letters
             and ( unicodedata.category( chr(code16be)) in unicode_category )
             # and  len( chr(code16be).encode('gb2312','ignore'))> 0
             ):
              # if len(unicodedata.normalize('NFKD',chr(code16be))) == 1:
              init_chars_16.append( code16be)
          else:
              init_chars_undef.append(line[1])
    else:
      # The WCTABLE tag marks the start of the Unicode UTF-16 (WideChar) to "MultiByte" bytes
      wctable = rawline.startswith( 'WCTABLE')
      # for debugging purposes if wctable: print(i, rawline)

init_chars_16 = sorted( set( init_chars_16 ) )
init_chars_utf16 = [ chr(x) for x in init_chars_16 ]

init_chars_groups = [] # groups of consecutive code points
#                      # https://stackoverflow.com/questions/2154249/
for k, g in groupby( enumerate(init_chars_16), lambda ix : ix[0] - ix[1]):
  init_chars_groups.append( first_last(list(map(itemgetter(1), g))))

def finalprint(od, do, odg, dog, sep):
  global init_chars_utf16, init_chars_16, init_chars_groups
  print( ''.join( init_chars_utf16[od:do] )+'\n') # characters
  print( ', '.join( '0x{:04x}'.format(x)      
      for x in init_chars_16[od:do] )+'\n')       # their Unicode codepoints
  print( sep.join( x 
      for x in init_chars_groups[odg:dog])+'\n' ) # groups of Unicode codepoints
  print( len(init_chars_groups), ':', odg, dog )  # total number of groups and displayed range

if init_chars_test == 1 and demo_chars_test:
  finalprint(354, 380, 23, 30, ' ')
else:
  finalprint(0, None, 0, None, '|')

print( init_chars_test, len(init_chars_CP936), len(init_chars_undef),
       len(init_chars_utf16), len(init_chars_groups) ) # resume

if demo_chars_test:
    print( '\nUsage:\n\t%s [ 1 | 2 ]\n' % sys.argv[0])
    print( 'Examples:\n\t%s        // prints CP936-initial-character DEMO' % sys.argv[0])
    print( '\n\t%s 1      // prints CP936-initial-character' % sys.argv[0])
    print( '\n\t%s 2      // prints CP936-subsequent-character addendum' % sys.argv[0])

Output: .\SO\68766804.py

DEMO CP936-initial-character: ['Ll', 'Lu', 'Lo']

一丁七万丈三上下丌不与丐丑专且丕世丘丙业丛东丝丞丢两

0x4e00, 0x4e01, 0x4e03, 0x4e07, 0x4e08, 0x4e09, 0x4e0a, 0x4e0b, 0x4e0c, 0x4e0d, 0x4e0e, 0x4e10, 0x4e11, 0x4e13, 0x4e14, 0x4e15, 0x4e16, 0x4e18, 0x4e19, 0x4e1a, 0x4e1b, 0x4e1c, 0x4e1d, 0x4e1e, 0x4e22, 0x4e24

0x4e00..0x4e01 0x4e03 0x4e07..0x4e0e 0x4e10..0x4e11 0x4e13..0x4e16 0x4e18..0x4e1e 0x4e22

1977 : 23 30
1 23159 2085 15477 1977

Usage:
        D:\bat\SO\68766804.py [ 1 | 2 ]

Examples:
        D:\bat\SO\68766804.py        // prints CP936-initial-character DEMO

        D:\bat\SO\68766804.py 1      // prints CP936-initial-character

        D:\bat\SO\68766804.py 2      // prints CP936-subsequent-character addendum

Output: .\SO\68766804.py 2

CP936-subsequent-character addendum: ['Ll', 'Lu', 'Lo', 'Nd', 'Pc']

０１２３４５６７８９＿

0xff10, 0xff11, 0xff12, 0xff13, 0xff14, 0xff15, 0xff16, 0xff17, 0xff18, 0xff19, 0xff3f

0xff10..0xff19|0xff3f

2 : 0 None
2 11 0 11 2

Thank you for your answer. I will leave a comment here once i have tested it. — SoftTimur, Aug 22 '21 at 08:07

Build a token for Simplified Chinese Identifiers

1 Answers1