Remove all special characters from text except for "\n" and "/"

Question

From this post I found how to remove everything from a text than spaces and alphanumeric: Python: Strip everything but spaces and alphanumeric.

In this way:

re.sub(r'([^\s\w]|_)+', '', document)

I wanted basically to remove all the special characters.

However, now I want to do the same (i.e. to remove all the special characters) but without removing the following special characters:

How do I do this?

score 2 · Accepted Answer · answered May 23 '19 at 16:27

2

We can try rewriting your pattern without using the rigid character classes:

document = "Hello!@#$/ World!"
output = re.sub(r'[^ \nA-Za-z0-9/]+', '', document)
print(output)

Hello/ World

This says to remove any character which is not alphanumeric, space, newline, or forward slash.

answered May 23 '19 at 16:27

Tim Biegeleisen

1

The question is: Should `'Motörhead'` be an alphanumeric string or not? – Matthias May 23 '19 at 16:32
@Matthias What about `Radiöhead`? Never liked their music anyway though. – Tim Biegeleisen May 23 '19 at 16:32
Thanks, this looks good. By the way, I do not if your comments are related to this but I am wondering if letters such as `ö` are considered special characters and hence they will be removed? If yes then this is a problem - I may write or find a separate question for this. – Outcast May 23 '19 at 16:45
@PoeteMaudit Yes, accented characters would be removed, but you could also include them in the character class. The problem here, as you have seen, is that saying `[^\w]` would automatically exclude `/`. – Tim Biegeleisen May 23 '19 at 16:47
Ok yes so probably I will have to write or find a separate question about this since I want to keep all accented letters. – Outcast May 23 '19 at 16:50

Error - Syntactical Remorse · Answer 2 · 2019-05-23T16:34:45.580

1

I may be missing the full use case but you could do this without regex:

s = "test\r\n\\ this\n"
s = ''.join(char for char in s if char.isalnum() or char in {'\\', '\n', ' '})
print(s)

The .isalnum() handles most alpha numeric characters including unicode.

edited May 23 '19 at 16:34

answered May 23 '19 at 16:23

2 Answers2