Is there a consensus about printable UTF8 characters (to be used for a username)?

Question

In my chat app TalkTalkTalk, for usernames, I allowed alphanumeric characters only (A-Z, a-z, 0-9):

username = re.sub(r'\W+', '', username)        # regex to keep alphanumeric only

This is a bit too restrictive because UTF8 characters are useful in many cases (people who have a name with another alphabet than latin, etc.). Now I would like to allow these useful UTF8 characters from other alphabets, and even things like ❤ ☀ ☆ ☂ ☻ ♞ ☯ ☭ ☢. (Why not?)

But I don't want :

all kind of whitespaces, all kind of newlines (
)
malicious characters that look like empty zero-width char : http://unicode-table.com/fr/200D/
etc. and more generally every character that could make that userA<malicious_char> looks like real userA.

Which are the printable UTF8 characters? (to be used in a username)

How to filter them with a regex, for example in Python?

Note: This question is about finding a regex to filter them, so it's not a duplicate of some linked questions.

I modified the question to make it clearer / less opinion-based. — Basj, Nov 22 '16 at 21:10
Possible duplicate of [What is the range of Unicode Printable Characters?](http://stackoverflow.com/questions/3770117/what-is-the-range-of-unicode-printable-characters) — Stop harming Monica, Nov 22 '16 at 21:19
@Goyo it's linked, but this question here is about finding a regex to filter such a string, and the duplicate is not about this. Thanks btw tor the link. — Basj, Nov 22 '16 at 21:21
You could consider allowing *any* character, but require the [`Unidecode`](http://pypi.python.org/pypi/Unidecode) representation to be unique. — Mark Ransom, Nov 22 '16 at 21:26
I presume you have a database where you keep usernames? I know nothing about your app so I can't give you any specific advice, and it would be beyond the scope of your question. — Mark Ransom, Nov 22 '16 at 21:46
You should take a look at the Unicode properties for characters ([Character Properties](http://www.unicode.org/versions/Unicode9.0.0/ch04.pdf), chapter 4 of the Unicode standard). From there, you should decide which properties are acceptable for use in a user name, and then accept only those characters that match your chosen set of properties. — Jonathan Leffler, Nov 22 '16 at 22:00
Have a look at Unicode [General Categories](https://en.wikipedia.org/wiki/Unicode_character_property#General_Category). You probably want to exclude Separator and Other, and only allow Letter, Mark, Number, Punctuation, and Symbol. — nwellnhof, Nov 22 '16 at 22:38
@nwellnhof Nice idea. Would be interesting to find a regex for this... — Basj, Nov 22 '16 at 22:50
It should be possible with the [`regex` module](http://stackoverflow.com/questions/1832893/python-regex-matching-unicode-properties). The regex would look like `(\p{L}\p{M}\p{N}\p{P}\p{S})+`. But there are other issues you should consider like normalization, Zalgo text, or characters from different scripts with the same graphical representation (Cyrillic A vs. Latin A, for example). This really is a broad question... — nwellnhof, Nov 23 '16 at 11:32

Jose Ricardo Bustos M. · Answer 1 · 2016-11-22T21:59:02.750

2

You can use flag re.UNICODE and unicode in regex expression, \u200b is not technically defined as whitespace

python 2.7 and 3

import re
username = u'My \u200bNick \u2602 \u263b \u200c '
white_chars = ['\s', u'\u200b',u'\u200c']     #etc
regex_str = '[' + ''.join(white_chars) + ']'
regex = re.compile(regex_str, flags=re.UNICODE)
regex.sub("", username )
print ( regex.sub("", username ) )

you get

u'MyNick\u2602\u263b'
MyNick☂☻

edited Nov 22 '16 at 21:59

answered Nov 22 '16 at 21:26

Jose Ricardo Bustos M.

8,016
6
40
62

Thanks. Would this allow ❤ ☀ ? I would like those characters to be allowed. – Basj Nov 22 '16 at 21:29
I had misunderstood, fixed in post – Jose Ricardo Bustos M. Nov 22 '16 at 21:40
What happens if ` ` is in the input string? i.e. `\u000A` or other non-printable char? It seems that there are many many non printable char: http://stackoverflow.com/a/3770259/1422096 – Basj Nov 22 '16 at 21:44
for example `\u200c` zero width non-joiner, etc .... I would use a list – Jose Ricardo Bustos M. Nov 22 '16 at 21:59

Is there a consensus about printable UTF8 characters (to be used for a username)?

1 Answers1