1

From this post I found how to remove everything from a text than spaces and alphanumeric: Python: Strip everything but spaces and alphanumeric.

In this way:

re.sub(r'([^\s\w]|_)+', '', document)

I wanted basically to remove all the special characters.

However, now I want to do the same (i.e. to remove all the special characters) but without removing the following special characters:

  1. \n
  2. /

How do I do this?

Outcast
  • 4,967
  • 5
  • 44
  • 99

2 Answers2

2

We can try rewriting your pattern without using the rigid character classes:

document = "Hello!@#$/ World!"
output = re.sub(r'[^ \nA-Za-z0-9/]+', '', document)
print(output)

Hello/ World

This says to remove any character which is not alphanumeric, space, newline, or forward slash.

Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360
  • 1
    The question is: Should `'Motörhead'` be an alphanumeric string or not? – Matthias May 23 '19 at 16:32
  • @Matthias What about `Radiöhead`? Never liked their music anyway though. – Tim Biegeleisen May 23 '19 at 16:32
  • Thanks, this looks good. By the way, I do not if your comments are related to this but I am wondering if letters such as `ö` are considered special characters and hence they will be removed? If yes then this is a problem - I may write or find a separate question for this. – Outcast May 23 '19 at 16:45
  • @PoeteMaudit Yes, accented characters would be removed, but you could also include them in the character class. The problem here, as you have seen, is that saying `[^\w]` would automatically exclude `/`. – Tim Biegeleisen May 23 '19 at 16:47
  • Ok yes so probably I will have to write or find a separate question about this since I want to keep all accented letters. – Outcast May 23 '19 at 16:50
1

I may be missing the full use case but you could do this without regex:

s = "test\r\n\\ this\n"
s = ''.join(char for char in s if char.isalnum() or char in {'\\', '\n', ' '})
print(s)

The .isalnum() handles most alpha numeric characters including unicode.