How to replace all characters except letters, numbers, forward and back slashes

Question

Want to parse through text and return only letters, digits, forward and back slashes and replace all else with ''.

Is it possible to use just one regex pattern as opposed to several which then calls for looping? Am unable to get the pattern below not to replace the back and forward slash.

line1 = "1/R~e`p!l@@a#c$e%% ^A&l*l( S)-p_e+c=ial C{har}act[er]s ;E  xce|pt Forw:ard\" $An>d B,?a..ck Sl'as<he#s\\2"
line2 = line
RGX_PATTERN = "[^\w]", "_"

for pattern in RGX_PATTERN:
    line = re.sub(r"%s" %pattern, '', line)
print("replace1: " + line)
#Prints: 1ReplaceAllSpecialCharactersExceptForwardAndBackSlashes2

The code below from SO had been tested and found to be faster than regex but then it replaces all special characters including the / and \ that I want to preserve. Is there any way to edit it to work for my use case and still maintain its edge over regex?

line2 = ''.join(e for e in line2 if e.isalnum())
print("replace2: " + line2)
#Prints: 1ReplaceAllSpecialCharactersExceptForwardAndBackSlashes2

As an extra hurdle, the text am parsing should be in ASCII form so if possible characters from any other encoding should also be replaced by ''

Veedrac · Accepted Answer · 2014-05-08T04:11:01.580

A fair bit faster and works for Unicode:

full_pattern = re.compile('[^a-zA-Z0-9\\\/]|_')

def re_replace(string):
    return re.sub(full_pattern, '', string)

If you want it really fast, this is by far the best (but slightly obscure) method:

def wanted(character):
    return character.isalnum() or character in '\\/'

ascii_characters = [chr(ordinal) for ordinal in range(128)]
ascii_code_point_filter = [c if wanted(c) else None for c in ascii_characters]

def fast_replace(string):
    # Remove all non-ASCII characters. Heavily optimised.
    string = string.encode('ascii', errors='ignore').decode('ascii')

    # Remove unwanted ASCII characters
    return string.translate(ascii_code_point_filter)

Timings:

SETUP="
busy = ''.join(chr(i) for i in range(512))

import re
full_pattern = re.compile('[^a-zA-Z0-9\\\/]|_')

def in_whitelist(character):
    return character.isalnum() or character in '\\/'

def re_replace(string):
    return re.sub(full_pattern, '', string)

def wanted(character):
    return character.isalnum() or character in '\\/'

ascii_characters = [chr(ordinal) for ordinal in range(128)]
ascii_code_point_filter = [c if wanted(c) else None for c in ascii_characters]

def fast_replace(string):
    string = string.encode('ascii', errors='ignore').decode('ascii')
    return string.translate(ascii_code_point_filter)
"

python -m timeit -s "$SETUP" "re_replace(busy)"
python -m timeit -s "$SETUP" "''.join(e for e in busy if in_whitelist(e))"
python -m timeit -s "$SETUP" "fast_replace(busy)"

Results:

10000 loops, best of 3: 63 usec per loop
10000 loops, best of 3: 135 usec per loop
100000 loops, best of 3: 4.98 usec per loop

produces exactly the same output as mine for all of these: ª²³µ¹º¼½¾ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàá — Master_Yoda, May 08 '14 at 03:53
@Master_Yoda; You are probably using Python 2. OP is using Python 3. — Veedrac, May 08 '14 at 03:55

Master_Yoda · Answer 2 · 2014-05-08T03:45:34.397

4

Why can't you do something like:

def in_whitelist(character):
    return character.isalnum() or character in ['\\','/']

line2 = ''.join(e for e in line2 if in_whitelist(e))

Edited as per suggestion to condense function.

edited May 08 '14 at 03:45

answered May 08 '14 at 03:32

Master_Yoda

1,092
2
10
18

I would personally change the last part to `character in ['\', '/']` for brevity. – Khaelex May 08 '14 at 03:39
Ok. This worked. Just had to escape the backslash `['\\', '/']` – lukik May 08 '14 at 03:43
Worked for me after escaping the string literal... Oh, and @Khaelid, agreed. – Master_Yoda May 08 '14 at 03:44
Fails for all of these: `ª²³µ¹º¼½¾ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàá` (etc.) – Veedrac May 08 '14 at 03:45

How to replace all characters except letters, numbers, forward and back slashes

2 Answers2