1

Is there a best practice to remove weird whitespace unicode characters from strings in Python?

For example if a string contains one of the following unicodes in this table I would like to remove it.

I was thinking of putting the unicodes into a list then doing a loop using replace but I'm sure there is a more pythonic way of doing so.

datagoat
  • 63
  • 5
  • It'd be pythonic to use a regex. regex is part of the standard Python packages. – Jongware Jan 30 '20 at 20:01
  • Take a look at this post. It's not a duplicate, but it may help. https://stackoverflow.com/questions/37903317/is-there-a-python-constant-for-unicode-whitespace – Michael Bianconi Jan 30 '20 at 20:02
  • 1
    In my opinion, don't worry too much about writing "pythonic" code. I think the term can become counterproductive, because I don't think it's even all that well defined. Write the code in the most elegant way you can think of that reads well and then if one day you can think of a more elegant way then rewrite the code. If you want to remove only certain characters, then I think putting those characters in a list and removing them that way will be fine. As long as this is done in a simple function you create, then if you ever need to improve the performance then you can edit it – LetEpsilonBeLessThanZero Jan 30 '20 at 20:03

2 Answers2

4

You should be able to use this

[''.join(letter for letter in word if not letter.isspace()) for word in word_list] 

because if you read the docs for str.isspace it says:

Return True if there are only whitespace characters in the string and there is at least one character, False otherwise.

A character is whitespace if in the Unicode character database (see unicodedata), either its general category is Zs (“Separator, space”), or its bidirectional class is one of WS, B, or S.

If you look at the unicode character list for category Zs.

Cory Kramer
  • 114,268
  • 16
  • 167
  • 218
3

Regex is your friend in cases like this, you can simply iterate over your list applying a regex substitution

import re
r = re.compile(r"^\s+")

dirty_list = [...]
# iterate over dirty_list substituting
# any whitespace with an empty string
clean_list = [
  r.sub("", s)
  for s in dirty_list
]
Bitsplease
  • 306
  • 2
  • 12
  • Nice! I just tested to make sure, and indeed `\s` matches all of the special spaces in the wikipedia list. – Jongware Jan 30 '20 at 23:49