4

The goal is to port this Perl regex (from here) into Python:

$norm_text =~ s/(\P{N})(\p{P})/$1 $2 /g;

First I've copied the \p{P} and \P{N} character array into a readable text file:

I.e.

import requests
from six import text_type

n_url = 'https://raw.githubusercontent.com/alvations/charguana/master/charguana/data/perluniprops/Number.txt'
p_url = 'https://raw.githubusercontent.com/alvations/charguana/master/charguana/data/perluniprops/Punctuation.txt'

NUMS = text_type(requests.get(n_url).content.decode('utf8'))
PUNCTS = text_type(requests.get(p_url).content.decode('utf8'))

But when I tried to compile the regex:

re.compile(u'([{n}])([{p}])'.format(n=NUMS, p=PUNCTS)

It throws this error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/alvas/anaconda3/lib/python3.6/re.py", line 233, in compile
    return _compile(pattern, flags)
  File "/Users/alvas/anaconda3/lib/python3.6/re.py", line 301, in _compile
    p = sre_compile.compile(pattern, flags)
  File "/Users/alvas/anaconda3/lib/python3.6/sre_compile.py", line 562, in compile
    p = sre_parse.parse(p, flags)
  File "/Users/alvas/anaconda3/lib/python3.6/sre_parse.py", line 856, in parse
    p = _parse_sub(source, pattern, flags & SRE_FLAG_VERBOSE, False)
  File "/Users/alvas/anaconda3/lib/python3.6/sre_parse.py", line 415, in _parse_sub
    itemsappend(_parse(source, state, verbose))
  File "/Users/alvas/anaconda3/lib/python3.6/sre_parse.py", line 763, in _parse
    p = _parse_sub(source, state, sub_verbose)
  File "/Users/alvas/anaconda3/lib/python3.6/sre_parse.py", line 415, in _parse_sub
    itemsappend(_parse(source, state, verbose))
  File "/Users/alvas/anaconda3/lib/python3.6/sre_parse.py", line 552, in _parse
    raise source.error(msg, len(this) + 1 + len(that))
sre_constants.error: bad character range ~-- at position 217 (line 1, column 218)

Looking around the problem seems to be that the dashes that are not escaped within the character sets, Python regex bad character range..

It looks like there's a range of dash like symbols in:

>>> NUMS[215:352]
'~----------------------------------------------------------------------------------------------------------------------------------------'

Then I moved the dashes characters to the front of the string but there's more bad characters:

>>> NUMS2 = re.escape(NUMS[215:352]) + NUMS[:215] + NUMS[352:]
>>> re.compile(u'([{n}])'.format(n=NUMS2))

[out]:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/alvas/anaconda3/lib/python3.6/re.py", line 233, in compile
    return _compile(pattern, flags)
  File "/Users/alvas/anaconda3/lib/python3.6/re.py", line 301, in _compile
    p = sre_compile.compile(pattern, flags)
  File "/Users/alvas/anaconda3/lib/python3.6/sre_compile.py", line 562, in compile
    p = sre_parse.parse(p, flags)
  File "/Users/alvas/anaconda3/lib/python3.6/sre_parse.py", line 856, in parse
    p = _parse_sub(source, pattern, flags & SRE_FLAG_VERBOSE, False)
  File "/Users/alvas/anaconda3/lib/python3.6/sre_parse.py", line 415, in _parse_sub
    itemsappend(_parse(source, state, verbose))
  File "/Users/alvas/anaconda3/lib/python3.6/sre_parse.py", line 763, in _parse
    p = _parse_sub(source, state, sub_verbose)
  File "/Users/alvas/anaconda3/lib/python3.6/sre_parse.py", line 415, in _parse_sub
    itemsappend(_parse(source, state, verbose))
  File "/Users/alvas/anaconda3/lib/python3.6/sre_parse.py", line 552, in _parse
    raise source.error(msg, len(this) + 1 + len(that))
sre_constants.error: bad character range ¬-- at position 502 (line 1, column 503)

So I moved more characters to the front:

>>> NUMS2 = re.escape(NUMS[215:352]) + NUMS[:215] + NUMS[352:]
>>> NUMS3 = re.escape(NUMS2[500:504]) + NUMS2[:500] + NUMS2[504:]
>>> re.compile(u'([{n}])'.format(n=NUMS3))

This seems to be an endless cycle of detecting what is a "bad character range" in a regex.

Is there a way to automatically identify all "bad characters" in a regex and shift them to the front?

alvas
  • 115,346
  • 109
  • 446
  • 738
  • 2
    All you need is to escape the `^`, `-`, `]` and ``\`` chars. Try `re.sub(r'[]^\\-]', r'\\\g<0>', NUMS)` and `re.sub(r'[]^\\-]', r'\\\g<0>', PUNCTS)` and then pass them to the `.format()` method. – Wiktor Stribiżew Aug 14 '17 at 09:07
  • It seems like the original NUMS has duplicates too =) @WiktorStribiżew the regex you provided worked. – alvas Aug 14 '17 at 09:22
  • First, `NUMS = ''.join(set(text_type(requests.get(n_url).content.decode('utf8')))` , then `NUMS = re.sub(r'[]^\\-]', r'\\\g<0>', NUMS)` And this works `re.compile(u'([{n}])'.format(n=NUMS))` – alvas Aug 14 '17 at 09:22
  • 1
    Yes, the lists you have seem to contain dupes, but that is something you can handle easily, yes, e.g. with `set()`. – Wiktor Stribiżew Aug 14 '17 at 09:26

1 Answers1

5

The main point here is that you need to escape the ^, -, ] and \ chars inside a character class.

Use

NUMS = re.sub(r'[]^\\-]', r'\\\g<0>', NUMS)
PUNCTS = re.sub(r'[]^\\-]', r'\\\g<0>', PUNCTS)
rx = re.compile(u'([{n}])([{p}])'.format(n=NUMS, p=PUNCTS)

The r'[]^\\-]' pattern will match 1 char - ], ^, \ or - - and r'\\\g<0>' replacement will replace the match value with a \ and the match value.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563