In my chat app TalkTalkTalk, for usernames, I allowed alphanumeric characters only (A-Z, a-z, 0-9):
username = re.sub(r'\W+', '', username) # regex to keep alphanumeric only
This is a bit too restrictive because UTF8 characters are useful in many cases (people who have a name with another alphabet than latin
, etc.). Now I would like to allow these useful UTF8 characters from other alphabets, and even things like ❤ ☀ ☆ ☂ ☻ ♞ ☯ ☭ ☢. (Why not?)
But I don't want :
all kind of whitespaces, all kind of newlines (
)malicious characters that look like empty zero-width char : http://unicode-table.com/fr/200D/
etc. and more generally every character that could make that
userA<malicious_char>
looks like realuserA
.
Which are the printable UTF8 characters? (to be used in a username)
How to filter them with a regex, for example in Python?
Note: This question is about finding a regex to filter them, so it's not a duplicate of some linked questions.