0

I'm trying to replace all non-letters and non-whitespaces with ''. I thought the below code worked fine in a number of test cases, however, it failed when it comes to special, escaped characters.

import re
def process_text(text):
  text = text.lower()
  text = re.sub(pattern='[^A-z ^\s]',repl='',string=text).split(' ')
  return [word for word in text if word != '']

process_text('abc 123')
>>>> ['abc'] # this is what I wanted.

process_text('abc 123 \n')
>>>> ['abc', '\n'] # I don't want the new line character.

The below link informed me that \s was any whitespace. https://www.debuggex.com/cheatsheet/regex/python

However, the official documentation says that \s is equivalent to "Matches any whitespace character; this is equivalent to [ \t\n\r\f\v]." https://docs.python.org/3/howto/regex.html

So I see now that my code says ~find anything that is not a letter and not in the above set of special characters and replace it with ''.

So is there a way to retain whitespace but remove the other special characters?

  • Also see [this post](https://stackoverflow.com/questions/29771901/why-is-this-regex-allowing-a-caret) about using `[A-z]` instead of `[A-Za-z]` – The fourth bird Oct 23 '19 at 16:11

2 Answers2

1

To match all non-word and non-whitespace characters, you can use [^\w\s] - \w is any letter, number, or underscore, and \s is whitespace. If you'd prefer to only get letters, you can use [^a-zA-Z\s] instead.

(Also, when you're negating a capture group, you only need to put ^ at the very start.)

Nick Reed
  • 4,989
  • 4
  • 17
  • 37
  • When I redefined the second line of the function as `text = re.sub(pattern='[^\w\s] - \w',repl='',string=text).split(' ')` the output on the following `process_text('abc 123\n')` the output is: `['abc', '123\n']`. Maybe I misunderstood your answer. –  Oct 23 '19 at 16:40
  • I think the misunderstanding is on my part - did you also specifically want to remove literal characters like `\n` instead of the newline? – Nick Reed Oct 23 '19 at 16:43
1

So there are following things which are wrong in your pattern, let's address them first

  • A-z - It includes all the character from ascii table starting from A to z, which also has non alphabetical characters which we don't want to match, so the correct one should be [A-Z] if we want only uppercase, if we want both upper and lowercase then it should be [A-Za-z] or you can turn on i flag
  • ^\s - ^ means negation only when you use it as first character inside the character class elsewhere it is treated as literal ^

So your regex should be

 [^A-Za-z\s]
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
Code Maniac
  • 37,143
  • 5
  • 39
  • 60
  • I think this is close, but not quite there. For example, when using `[^A-Za-z\s]` above `process_text('abc 123\n')` the returned list is `['abc', '\n']` –  Oct 23 '19 at 16:37
  • 1
    @mjake `\s` includes all kind of space characters, if you want only `space` not the newline etc, just change it to `[^A-Za-z ]` – Code Maniac Oct 23 '19 at 16:39
  • It worked perfect, thanks! I can't believe I was so close...yet so far from figuring it out. –  Oct 23 '19 at 16:42