0

In "narrow" Python builds we should use special regex pattern to match range of surrogate pairs. This pattern can be rather complex:

# Pattern we want:
pattern = '[\U000105c0-\U0001cb40]'

# Pattern we should use in "narrow" build:
pattern = '(?:\uD801[\uDDC0-\uDFFF]|[\uD802-\uD831][\uDC00-\uDFFF]|\uD832[\uDC00-\uDF40])'

But how can I create one for given arbitrary surrogates range (for example \U000105c0-\U0001cb40)?

What would be algorithm of creating this pattern?

Is there any ready to use solution in Python?

Community
  • 1
  • 1
Mikhail Gerasimov
  • 36,989
  • 16
  • 116
  • 159

2 Answers2

0

Install the http://www.regexformat.com app.
(for windows)

You can do the below with any range.
Just need a regex to describe it (or anything).

Open the UCD Interface https://i.stack.imgur.com/bNTGV.jpg

On the Custom-Rx page, enter [\x{105c0}-\x{1cb40}]

Select the conversion Syntax you want in the output
(this used \x{} syntax).

Click button Get Hex Conversion -> UTF-16 (its a menu button)

Copy the regex at the bottom of the Result box.

 (?:
      \x{D801} [\x{DDC0}-\x{DFFF}] 
   |  [\x{D802}-\x{D831}] [\x{DC00}-\x{DFFF}] 
   |  \x{D832} [\x{DC00}-\x{DF40}] 
 )

If you paste it into one of the main app documents and
hit compress, it turns out

(?:\x{D801}[\x{DDC0}-\x{DFFF}]|[\x{D802}-\x{D831}][\x{DC00}-\x{DFFF}]|\x{D832}[\x{DC00}-\x{DF40}])

Here it is using the \uXXXX syntax
(?:\uD801[\uDDC0-\uDFFF]|[\uD802-\uD831][\uDC00-\uDFFF]|\uD832[\uDC00-\uDF40])

0

I created function that handles most of the cases we may need.

Python 2 code:

from __future__ import absolute_import, division, print_function, unicode_literals
__metaclass__ = type

import struct


def unichar(i):
    """
    unichr for "narrow" builds.
    """
    try:
        return unichr(i)
    except ValueError:
        return struct.pack('i', i).decode('utf-32')


def get_pattern(char_from, char_to):
    """
    Returns regex pattern for unicode chars that handles surrogates in "narrow" builds.
    """
    if all(len(c) == 1 for c in (char_from, char_to)):
        if char_from == char_to:
            return char_from
        else:
            return '[{}-{}]'.format(char_from, char_to)
    elif all(len(c) == 2 for c in (char_from, char_to)):
        f1, f2 = [ord(i) for i in char_from]
        t1, t2 = [ord(i) for i in char_to]
        if t1 - f1 == 0:
            p1 = '{}[{}-{}]'.format(unichar(f1), unichar(f2), unichar(t2))
            return '(?:' + p1 + ')'
        elif t1 - f1 == 1:
            p1 = '{}[{}-\uDFFF]'.format(unichar(f1), unichar(f2))
            p3 = '{}[\uDC00-{}]'.format(unichar(t1), unichar(t2))
            return '(?:' + '|'.join([p1, p3]) + ')'
        else:
            p1 = '{}[{}-\uDFFF]'.format(unichar(f1), unichar(f2))
            p2 = '[{}-{}][\uDC00-\uDFFF]'.format(unichar(f1+1), unichar(t1-1), unichar(f2))
            p3 = '{}[\uDC00-{}]'.format(unichar(t1), unichar(t2))
            return '(?:' + '|'.join([p1, p2, p3]) + ')'
    else:
        raise ValueError('Range is not supported by this function {}-{}'.format(char_from, char_to))


# Example:
if __name__ == '__main__':
    print(repr(get_pattern('\U000105c0', '\U0001cb40')))

    # (?:\ud801[\uddc0-\udfff]|[\ud802-\ud831][\udc00-\udfff]|\ud832[\udc00-\udf40])
Mikhail Gerasimov
  • 36,989
  • 16
  • 116
  • 159
  • Ok, your next convert assignment is to do something useful with what you've learned. Given a more meaningful utf-32 class `[\u2E80-\u2EFF\u3000-\u303F\u31C0-\u31EF\u3300-\u4DBF\u4E00-\u9FFF\uF900-\uFAFF\uFE30-\uFE4F\U00020000-\U0002A6DF\U0002A700-\U0002CEAF\U0002F800-\U0002FA1F]` convert it into utf-16 `(?:[\u2E80-\u2EFF\u3000-\u303F\u31C0-\u31EF\u3300-\u4DBF\u4E00-\u9FFF\uF900-\uFAFF\uFE30-\uFE4F]|(?:[\uD840-\uD868][\uDC00-\uDFFF]|\uD869[\uDC00-\uDEDF\uDF00-\uDFFF]|[\uD86A-\uD872][\uDC00-\uDFFF]|\uD873[\uDC00-\uDEAF]|\uD87E[\uDC00-\uDE1F]))` –  Mar 28 '17 at 04:39