How to identify regex bad characters in a long list of characters?

Question

The goal is to port this Perl regex (from here) into Python:

$norm_text =~ s/(\P{N})(\p{P})/$1 $2 /g;

First I've copied the \p{P} and \P{N} character array into a readable text file:

I.e.

import requests
from six import text_type

n_url = 'https://raw.githubusercontent.com/alvations/charguana/master/charguana/data/perluniprops/Number.txt'
p_url = 'https://raw.githubusercontent.com/alvations/charguana/master/charguana/data/perluniprops/Punctuation.txt'

NUMS = text_type(requests.get(n_url).content.decode('utf8'))
PUNCTS = text_type(requests.get(p_url).content.decode('utf8'))

But when I tried to compile the regex:

re.compile(u'([{n}])([{p}])'.format(n=NUMS, p=PUNCTS)

It throws this error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/alvas/anaconda3/lib/python3.6/re.py", line 233, in compile
    return _compile(pattern, flags)
  File "/Users/alvas/anaconda3/lib/python3.6/re.py", line 301, in _compile
    p = sre_compile.compile(pattern, flags)
  File "/Users/alvas/anaconda3/lib/python3.6/sre_compile.py", line 562, in compile
    p = sre_parse.parse(p, flags)
  File "/Users/alvas/anaconda3/lib/python3.6/sre_parse.py", line 856, in parse
    p = _parse_sub(source, pattern, flags & SRE_FLAG_VERBOSE, False)
  File "/Users/alvas/anaconda3/lib/python3.6/sre_parse.py", line 415, in _parse_sub
    itemsappend(_parse(source, state, verbose))
  File "/Users/alvas/anaconda3/lib/python3.6/sre_parse.py", line 763, in _parse
    p = _parse_sub(source, state, sub_verbose)
  File "/Users/alvas/anaconda3/lib/python3.6/sre_parse.py", line 415, in _parse_sub
    itemsappend(_parse(source, state, verbose))
  File "/Users/alvas/anaconda3/lib/python3.6/sre_parse.py", line 552, in _parse
    raise source.error(msg, len(this) + 1 + len(that))
sre_constants.error: bad character range ~-- at position 217 (line 1, column 218)

Looking around the problem seems to be that the dashes that are not escaped within the character sets, Python regex bad character range..

It looks like there's a range of dash like symbols in:

>>> NUMS[215:352]
'~----------------------------------------------------------------------------------------------------------------------------------------'

Then I moved the dashes characters to the front of the string but there's more bad characters:

>>> NUMS2 = re.escape(NUMS[215:352]) + NUMS[:215] + NUMS[352:]
>>> re.compile(u'([{n}])'.format(n=NUMS2))

[out]:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/alvas/anaconda3/lib/python3.6/re.py", line 233, in compile
    return _compile(pattern, flags)
  File "/Users/alvas/anaconda3/lib/python3.6/re.py", line 301, in _compile
    p = sre_compile.compile(pattern, flags)
  File "/Users/alvas/anaconda3/lib/python3.6/sre_compile.py", line 562, in compile
    p = sre_parse.parse(p, flags)
  File "/Users/alvas/anaconda3/lib/python3.6/sre_parse.py", line 856, in parse
    p = _parse_sub(source, pattern, flags & SRE_FLAG_VERBOSE, False)
  File "/Users/alvas/anaconda3/lib/python3.6/sre_parse.py", line 415, in _parse_sub
    itemsappend(_parse(source, state, verbose))
  File "/Users/alvas/anaconda3/lib/python3.6/sre_parse.py", line 763, in _parse
    p = _parse_sub(source, state, sub_verbose)
  File "/Users/alvas/anaconda3/lib/python3.6/sre_parse.py", line 415, in _parse_sub
    itemsappend(_parse(source, state, verbose))
  File "/Users/alvas/anaconda3/lib/python3.6/sre_parse.py", line 552, in _parse
    raise source.error(msg, len(this) + 1 + len(that))
sre_constants.error: bad character range ¬-- at position 502 (line 1, column 503)

So I moved more characters to the front:

>>> NUMS2 = re.escape(NUMS[215:352]) + NUMS[:215] + NUMS[352:]
>>> NUMS3 = re.escape(NUMS2[500:504]) + NUMS2[:500] + NUMS2[504:]
>>> re.compile(u'([{n}])'.format(n=NUMS3))

This seems to be an endless cycle of detecting what is a "bad character range" in a regex.

Is there a way to automatically identify all "bad characters" in a regex and shift them to the front?

All you need is to escape the `^`, `-`, `]` and ``\`` chars. Try `re.sub(r'[]^\\-]', r'\\\g<0>', NUMS)` and `re.sub(r'[]^\\-]', r'\\\g<0>', PUNCTS)` and then pass them to the `.format()` method. — Wiktor Stribiżew, Aug 14 '17 at 09:07
It seems like the original NUMS has duplicates too =) @WiktorStribiżew the regex you provided worked. — alvas, Aug 14 '17 at 09:22
First, `NUMS = ''.join(set(text_type(requests.get(n_url).content.decode('utf8')))` , then `NUMS = re.sub(r'[]^\\-]', r'\\\g<0>', NUMS)` And this works `re.compile(u'([{n}])'.format(n=NUMS))` — alvas, Aug 14 '17 at 09:22
Yes, the lists you have seem to contain dupes, but that is something you can handle easily, yes, e.g. with `set()`. — Wiktor Stribiżew, Aug 14 '17 at 09:26

score 5 · Accepted Answer · answered Aug 14 '17 at 09:25

5

The main point here is that you need to escape the ^, -, ] and \ chars inside a character class.

Use

NUMS = re.sub(r'[]^\\-]', r'\\\g<0>', NUMS)
PUNCTS = re.sub(r'[]^\\-]', r'\\\g<0>', PUNCTS)
rx = re.compile(u'([{n}])([{p}])'.format(n=NUMS, p=PUNCTS)

The r'[]^\\-]' pattern will match 1 char - ], ^, \ or - - and r'\\\g<0>' replacement will replace the match value with a \ and the match value.

answered Aug 14 '17 at 09:25

Wiktor Stribiżew

607,720
39
448
563

1

Awesome! Thanks Wiktor! – alvas Aug 14 '17 at 09:26

How to identify regex bad characters in a long list of characters?

1 Answers1