How to create regex pattern for arbitrary range of surrogate pairs

Question

In "narrow" Python builds we should use special regex pattern to match range of surrogate pairs. This pattern can be rather complex:

# Pattern we want:
pattern = '[\U000105c0-\U0001cb40]'

# Pattern we should use in "narrow" build:
pattern = '(?:\uD801[\uDDC0-\uDFFF]|[\uD802-\uD831][\uDC00-\uDFFF]|\uD832[\uDC00-\uDF40])'

But how can I create one for given arbitrary surrogates range (for example \U000105c0-\U0001cb40)?

What would be algorithm of creating this pattern?

Is there any ready to use solution in Python?

Do you mean you want to replace such patterns (even in longer patters) dynamically? — Wiktor Stribiżew, Mar 25 '17 at 08:28
@WiktorStribiżew let's say generate for two given chars: `get_pattern('\U000105c0', '\U0001cb40')` — Mikhail Gerasimov, Mar 25 '17 at 08:32
I know of a [JS ES6 to ES5 transpiler](https://github.com/mathiasbynens/regexpu). — Wiktor Stribiżew, Mar 25 '17 at 08:51

score 0 · Answer 1 · 2017-03-25T18:32:58.747

Install the http://www.regexformat.com app.
(for windows)

You can do the below with any range.
Just need a regex to describe it (or anything).

Open the UCD Interface https://i.stack.imgur.com/bNTGV.jpg

On the Custom-Rx page, enter [\x{105c0}-\x{1cb40}]

Select the conversion Syntax you want in the output
(this used \x{} syntax).

Click button Get Hex Conversion -> UTF-16 (its a menu button)

Copy the regex at the bottom of the Result box.

 (?:
      \x{D801} [\x{DDC0}-\x{DFFF}] 
   |  [\x{D802}-\x{D831}] [\x{DC00}-\x{DFFF}] 
   |  \x{D832} [\x{DC00}-\x{DF40}] 
 )

If you paste it into one of the main app documents and
hit compress, it turns out

(?:\x{D801}[\x{DDC0}-\x{DFFF}]|[\x{D802}-\x{D831}][\x{DC00}-\x{DFFF}]|\x{D832}[\x{DC00}-\x{DF40}])

Here it is using the \uXXXX syntax
(?:\uD801[\uDDC0-\uDFFF]|[\uD802-\uD831][\uDC00-\uDFFF]|\uD832[\uDC00-\uDF40])

score 0 · Accepted Answer · answered Mar 26 '17 at 20:39

I created function that handles most of the cases we may need.

Python 2 code:

from __future__ import absolute_import, division, print_function, unicode_literals
__metaclass__ = type

import struct


def unichar(i):
    """
    unichr for "narrow" builds.
    """
    try:
        return unichr(i)
    except ValueError:
        return struct.pack('i', i).decode('utf-32')


def get_pattern(char_from, char_to):
    """
    Returns regex pattern for unicode chars that handles surrogates in "narrow" builds.
    """
    if all(len(c) == 1 for c in (char_from, char_to)):
        if char_from == char_to:
            return char_from
        else:
            return '[{}-{}]'.format(char_from, char_to)
    elif all(len(c) == 2 for c in (char_from, char_to)):
        f1, f2 = [ord(i) for i in char_from]
        t1, t2 = [ord(i) for i in char_to]
        if t1 - f1 == 0:
            p1 = '{}[{}-{}]'.format(unichar(f1), unichar(f2), unichar(t2))
            return '(?:' + p1 + ')'
        elif t1 - f1 == 1:
            p1 = '{}[{}-\uDFFF]'.format(unichar(f1), unichar(f2))
            p3 = '{}[\uDC00-{}]'.format(unichar(t1), unichar(t2))
            return '(?:' + '|'.join([p1, p3]) + ')'
        else:
            p1 = '{}[{}-\uDFFF]'.format(unichar(f1), unichar(f2))
            p2 = '[{}-{}][\uDC00-\uDFFF]'.format(unichar(f1+1), unichar(t1-1), unichar(f2))
            p3 = '{}[\uDC00-{}]'.format(unichar(t1), unichar(t2))
            return '(?:' + '|'.join([p1, p2, p3]) + ')'
    else:
        raise ValueError('Range is not supported by this function {}-{}'.format(char_from, char_to))


# Example:
if __name__ == '__main__':
    print(repr(get_pattern('\U000105c0', '\U0001cb40')))

    # (?:\ud801[\uddc0-\udfff]|[\ud802-\ud831][\udc00-\udfff]|\ud832[\udc00-\udf40])

Ok, your next convert assignment is to do something useful with what you've learned. Given a more meaningful utf-32 class `[\u2E80-\u2EFF\u3000-\u303F\u31C0-\u31EF\u3300-\u4DBF\u4E00-\u9FFF\uF900-\uFAFF\uFE30-\uFE4F\U00020000-\U0002A6DF\U0002A700-\U0002CEAF\U0002F800-\U0002FA1F]` convert it into utf-16 `(?:[\u2E80-\u2EFF\u3000-\u303F\u31C0-\u31EF\u3300-\u4DBF\u4E00-\u9FFF\uF900-\uFAFF\uFE30-\uFE4F]|(?:[\uD840-\uD868][\uDC00-\uDFFF]|\uD869[\uDC00-\uDEDF\uDF00-\uDFFF]|[\uD86A-\uD872][\uDC00-\uDFFF]|\uD873[\uDC00-\uDEAF]|\uD87E[\uDC00-\uDE1F]))` — , Mar 28 '17 at 04:39

How to create regex pattern for arbitrary range of surrogate pairs

2 Answers2