0

I am trying to replace special characters by an underscore in a given string (a badly formatted file path) but I cannot get it to work.

Here is the code:

import string, re
from unidecode import unidecode

punc = string.punctuation
punc = string.punctuation.replace(r'.','') # remove the dot from that string
pattern = re.compile(rf'[{punc}]')
# I also tried this as pattern; but it doesn't help:
# pattern = r'[' + punc + ']' 

test_string = r"\\some\random.path${}[]~(éè%&)ç\file.txt"
test_string = unidecode(test_string) # kick off accented letters

print(re.sub(pattern, '_', test_string))
>: \\some\random_path_______ee___c\file_txt

Actually, because the 'dot' is not in the pattern string, I cannot understand why it has been replaced? (I don't want it to be replaced)

More strangely, if I shuffle the punctuation string:

from random import shuffle

punc = string.punctuation
punc = string.punctuation.replace(r'.','') # remove the dot

# shuffle punctuation:
punc = list(punc)
shuffle(punc)
punc = ''.join(punc)

pattern = re.compile(rf'[{punc}]')

it sometimes raise an error such as:

Traceback (most recent call last):

  File "/tmp/ipykernel_3429192/3014469097.py", line 1, in <cell line: 1>
    pattern = re.compile(rf'[' + punc +']')

  File "/usr/lib/python3.10/re.py", line 251, in compile
    return _compile(pattern, flags)

  File "/usr/lib/python3.10/re.py", line 303, in _compile
    p = sre_compile.compile(pattern, flags)

  File "/usr/lib/python3.10/sre_compile.py", line 788, in compile
    p = sre_parse.parse(p, flags)

  File "/usr/lib/python3.10/sre_parse.py", line 969, in parse
    raise source.error("unbalanced parenthesis")

error: unbalanced parenthesis

or, after some other shuffling which doesn't raise the above error, I got:

print(re.sub(pattern, '_', test_string))
>: \\some\random.path${}[]~(ee%&)c\file.txt

pattern 
>: re.compile(r'[|)_&{;=^\'-~]@,["><$:/!}*\#+(?%`]', re.UNICODE)

here it doesn't seem to work at all.

Also, as mentioned in the first code block and here, I also tried not to use re.compile() by directly using: pattern = r'[' + punc + ']' but it doesn't help.

This may also be interesting:

for i in range(len(punc)):
    punc = punc[:-1]
    pattern = r'[' + punc + ']'
    print(f'{i}: pattern: {pattern} replaced_str: ',  re.sub(pattern, '_', test_string))
    
0: pattern: [!"#$%&'()*+,-/:;<=>?@[\]^_`{|}] replaced_str:  \\some\random_path_____~_ee___c\file_txt
1: pattern: [!"#$%&'()*+,-/:;<=>?@[\]^_`{|] replaced_str:  \\some\random_path__}__~_ee___c\file_txt
2: pattern: [!"#$%&'()*+,-/:;<=>?@[\]^_`{] replaced_str:  \\some\random_path__}__~_ee___c\file_txt
3: pattern: [!"#$%&'()*+,-/:;<=>?@[\]^_`] replaced_str:  \\some\random_path_{}__~_ee___c\file_txt
4: pattern: [!"#$%&'()*+,-/:;<=>?@[\]^_] replaced_str:  \\some\random_path_{}__~_ee___c\file_txt
5: pattern: [!"#$%&'()*+,-/:;<=>?@[\]^] replaced_str:  \\some\random_path_{}__~_ee___c\file_txt
6: pattern: [!"#$%&'()*+,-/:;<=>?@[\]] replaced_str:  \\some\random_path_{}__~_ee___c\file_txt
Traceback (most recent call last):

  File "/tmp/ipykernel_3401192/4037584865.py", line 4, in <cell line: 1>
    print(f'{i}: pattern: {pattern} replaced_str: ',  re.sub(pattern, '_', test_string))

  File "/usr/lib/python3.10/re.py", line 209, in sub
    return _compile(pattern, flags).sub(repl, string, count)

  File "/usr/lib/python3.10/re.py", line 303, in _compile
    p = sre_compile.compile(pattern, flags)

  File "/usr/lib/python3.10/sre_compile.py", line 788, in compile
    p = sre_parse.parse(p, flags)

  File "/usr/lib/python3.10/sre_parse.py", line 955, in parse
    p = _parse_sub(source, state, flags & SRE_FLAG_VERBOSE, 0)

  File "/usr/lib/python3.10/sre_parse.py", line 444, in _parse_sub
    itemsappend(_parse(source, state, verbose, nested + 1,

  File "/usr/lib/python3.10/sre_parse.py", line 550, in _parse
    raise source.error("unterminated character set",

error: unterminated character set

In addition to the why, how could I achieve that goal properly?

Tested with Python 3.9, 3.10 and 3.11.

Ref: https://docs.python.org/3/library/string.html

This is nice (not tested yet, I'll come latter to edit and share my results): Best way to strip punctuation from a string but it actually remove the special char, it doesn't replace them. And it doesn't explain why my solution is working in such a weird way.

swiss_knight
  • 5,787
  • 8
  • 50
  • 92
  • Well, the first question is, have you tried printing `punc` to see what it is? – CrazyChucky Dec 07 '22 at 00:25
  • Yes, it's a string of all unwanted characters: !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~ (chances are that it will probably not get correctly printed here...obviously) See: https://docs.python.org/3/library/string.html#string.punctuation – swiss_knight Dec 07 '22 at 00:27
  • Does this answer your question? [Best way to strip punctuation from a string](https://stackoverflow.com/questions/265960/best-way-to-strip-punctuation-from-a-string) – Michael Ruth Dec 07 '22 at 01:04
  • That gas good ways to accomplish the asker's goal, but doesn't explain what's going wrong with their current method. – CrazyChucky Dec 07 '22 at 01:57
  • 1
    About a dozen of those characters have special meaning in a regular expression! Some of them aren't special inside a character set - but the character set was ended by the `]` in `punc`, rather than the one at the end of the regex. – jasonharper Dec 07 '22 at 01:58
  • This sounds interesting and my first investigations seems to go in tat way: the backslash followed by the closing square bracket ']' seems not appreciated during the parsing... I'll investigate the post shared by @MichaelRuth and come back later with an appropriate update. – swiss_knight Dec 07 '22 at 07:42
  • 1
    The dot is replaced due to range ```,-/``` which contains: `,` (U+002C, *Comma*) and `-` (U+002D, *Hyphen-Minus*) and `.` (U+002E, *Full Stop*) and `/` (U+002F, *Solidus*). – JosefZ Dec 07 '22 at 20:41

0 Answers0