The goal is to port this Perl regex (from here) into Python:
$norm_text =~ s/(\P{N})(\p{P})/$1 $2 /g;
First I've copied the \p{P}
and \P{N}
character array into a readable text file:
I.e.
import requests
from six import text_type
n_url = 'https://raw.githubusercontent.com/alvations/charguana/master/charguana/data/perluniprops/Number.txt'
p_url = 'https://raw.githubusercontent.com/alvations/charguana/master/charguana/data/perluniprops/Punctuation.txt'
NUMS = text_type(requests.get(n_url).content.decode('utf8'))
PUNCTS = text_type(requests.get(p_url).content.decode('utf8'))
But when I tried to compile the regex:
re.compile(u'([{n}])([{p}])'.format(n=NUMS, p=PUNCTS)
It throws this error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/alvas/anaconda3/lib/python3.6/re.py", line 233, in compile
return _compile(pattern, flags)
File "/Users/alvas/anaconda3/lib/python3.6/re.py", line 301, in _compile
p = sre_compile.compile(pattern, flags)
File "/Users/alvas/anaconda3/lib/python3.6/sre_compile.py", line 562, in compile
p = sre_parse.parse(p, flags)
File "/Users/alvas/anaconda3/lib/python3.6/sre_parse.py", line 856, in parse
p = _parse_sub(source, pattern, flags & SRE_FLAG_VERBOSE, False)
File "/Users/alvas/anaconda3/lib/python3.6/sre_parse.py", line 415, in _parse_sub
itemsappend(_parse(source, state, verbose))
File "/Users/alvas/anaconda3/lib/python3.6/sre_parse.py", line 763, in _parse
p = _parse_sub(source, state, sub_verbose)
File "/Users/alvas/anaconda3/lib/python3.6/sre_parse.py", line 415, in _parse_sub
itemsappend(_parse(source, state, verbose))
File "/Users/alvas/anaconda3/lib/python3.6/sre_parse.py", line 552, in _parse
raise source.error(msg, len(this) + 1 + len(that))
sre_constants.error: bad character range ~-- at position 217 (line 1, column 218)
Looking around the problem seems to be that the dashes that are not escaped within the character sets, Python regex bad character range..
It looks like there's a range of dash like symbols in:
>>> NUMS[215:352]
'~----------------------------------------------------------------------------------------------------------------------------------------'
Then I moved the dashes characters to the front of the string but there's more bad characters:
>>> NUMS2 = re.escape(NUMS[215:352]) + NUMS[:215] + NUMS[352:]
>>> re.compile(u'([{n}])'.format(n=NUMS2))
[out]:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/alvas/anaconda3/lib/python3.6/re.py", line 233, in compile
return _compile(pattern, flags)
File "/Users/alvas/anaconda3/lib/python3.6/re.py", line 301, in _compile
p = sre_compile.compile(pattern, flags)
File "/Users/alvas/anaconda3/lib/python3.6/sre_compile.py", line 562, in compile
p = sre_parse.parse(p, flags)
File "/Users/alvas/anaconda3/lib/python3.6/sre_parse.py", line 856, in parse
p = _parse_sub(source, pattern, flags & SRE_FLAG_VERBOSE, False)
File "/Users/alvas/anaconda3/lib/python3.6/sre_parse.py", line 415, in _parse_sub
itemsappend(_parse(source, state, verbose))
File "/Users/alvas/anaconda3/lib/python3.6/sre_parse.py", line 763, in _parse
p = _parse_sub(source, state, sub_verbose)
File "/Users/alvas/anaconda3/lib/python3.6/sre_parse.py", line 415, in _parse_sub
itemsappend(_parse(source, state, verbose))
File "/Users/alvas/anaconda3/lib/python3.6/sre_parse.py", line 552, in _parse
raise source.error(msg, len(this) + 1 + len(that))
sre_constants.error: bad character range ¬-- at position 502 (line 1, column 503)
So I moved more characters to the front:
>>> NUMS2 = re.escape(NUMS[215:352]) + NUMS[:215] + NUMS[352:]
>>> NUMS3 = re.escape(NUMS2[500:504]) + NUMS2[:500] + NUMS2[504:]
>>> re.compile(u'([{n}])'.format(n=NUMS3))
This seems to be an endless cycle of detecting what is a "bad character range" in a regex.
Is there a way to automatically identify all "bad characters" in a regex and shift them to the front?