Python regex to replace any characters that are not either letters or white space

Question

I'm trying to replace all non-letters and non-whitespaces with ''. I thought the below code worked fine in a number of test cases, however, it failed when it comes to special, escaped characters.

import re
def process_text(text):
  text = text.lower()
  text = re.sub(pattern='[^A-z ^\s]',repl='',string=text).split(' ')
  return [word for word in text if word != '']

process_text('abc 123')
>>>> ['abc'] # this is what I wanted.

process_text('abc 123 \n')
>>>> ['abc', '\n'] # I don't want the new line character.

The below link informed me that \s was any whitespace. https://www.debuggex.com/cheatsheet/regex/python

However, the official documentation says that \s is equivalent to "Matches any whitespace character; this is equivalent to [ \t\n\r\f\v]." https://docs.python.org/3/howto/regex.html

So I see now that my code says ~find anything that is not a letter and not in the above set of special characters and replace it with ''.

So is there a way to retain whitespace but remove the other special characters?

Also see [this post](https://stackoverflow.com/questions/29771901/why-is-this-regex-allowing-a-caret) about using `[A-z]` instead of `[A-Za-z]` — The fourth bird, Oct 23 '19 at 16:11

score 1 · Answer 1 · answered Oct 23 '19 at 16:09

1

To match all non-word and non-whitespace characters, you can use [^\w\s] - \w is any letter, number, or underscore, and \s is whitespace. If you'd prefer to only get letters, you can use [^a-zA-Z\s] instead.

(Also, when you're negating a capture group, you only need to put ^ at the very start.)

answered Oct 23 '19 at 16:09

Nick Reed

4,989
4
17
37

When I redefined the second line of the function as `text = re.sub(pattern='[^\w\s] - \w',repl='',string=text).split(' ')` the output on the following `process_text('abc 123\n')` the output is: `['abc', '123\n']`. Maybe I misunderstood your answer. – Oct 23 '19 at 16:40
I think the misunderstanding is on my part - did you also specifically want to remove literal characters like `\n` instead of the newline? – Nick Reed Oct 23 '19 at 16:43

score 1 · Accepted Answer · edited Apr 13 '21 at 09:30

1

So there are following things which are wrong in your pattern, let's address them first

A-z - It includes all the character from ascii table starting from A to z, which also has non alphabetical characters which we don't want to match, so the correct one should be [A-Z] if we want only uppercase, if we want both upper and lowercase then it should be [A-Za-z] or you can turn on i flag
^\s - ^ means negation only when you use it as first character inside the character class elsewhere it is treated as literal ^

So your regex should be

 [^A-Za-z\s]

edited Apr 13 '21 at 09:30

Wiktor Stribiżew

607,720
39
448
563

answered Oct 23 '19 at 16:15

Code Maniac

37,143
5
39
60

I think this is close, but not quite there. For example, when using `[^A-Za-z\s]` above `process_text('abc 123\n')` the returned list is `['abc', '\n']` – Oct 23 '19 at 16:37
1

@mjake `\s` includes all kind of space characters, if you want only `space` not the newline etc, just change it to `[^A-Za-z ]` – Code Maniac Oct 23 '19 at 16:39
It worked perfect, thanks! I can't believe I was so close...yet so far from figuring it out. – Oct 23 '19 at 16:42

Python regex to replace any characters that are not either letters or white space

2 Answers2