2

In my chat app TalkTalkTalk, for usernames, I allowed alphanumeric characters only (A-Z, a-z, 0-9):

username = re.sub(r'\W+', '', username)        # regex to keep alphanumeric only

This is a bit too restrictive because UTF8 characters are useful in many cases (people who have a name with another alphabet than latin, etc.). Now I would like to allow these useful UTF8 characters from other alphabets, and even things like ❤ ☀ ☆ ☂ ☻ ♞ ☯ ☭ ☢. (Why not?)

But I don't want :

  • all kind of whitespaces, all kind of newlines (
)

  • malicious characters that look like empty zero-width char : http://unicode-table.com/fr/200D/

  • etc. and more generally every character that could make that userA<malicious_char> looks like real userA.

Which are the printable UTF8 characters? (to be used in a username)

How to filter them with a regex, for example in Python?

Note: This question is about finding a regex to filter them, so it's not a duplicate of some linked questions.

Basj
  • 41,386
  • 99
  • 383
  • 673
  • I modified the question to make it clearer / less opinion-based. – Basj Nov 22 '16 at 21:10
  • Possible duplicate of [What is the range of Unicode Printable Characters?](http://stackoverflow.com/questions/3770117/what-is-the-range-of-unicode-printable-characters) – Stop harming Monica Nov 22 '16 at 21:19
  • @Goyo it's linked, but this question here is about finding a regex to filter such a string, and the duplicate is not about this. Thanks btw tor the link. – Basj Nov 22 '16 at 21:21
  • You could consider allowing *any* character, but require the [`Unidecode`](http://pypi.python.org/pypi/Unidecode) representation to be unique. – Mark Ransom Nov 22 '16 at 21:26
  • @MarkRansom nice idea, do you see how to do that? – Basj Nov 22 '16 at 21:30
  • I presume you have a database where you keep usernames? I know nothing about your app so I can't give you any specific advice, and it would be beyond the scope of your question. – Mark Ransom Nov 22 '16 at 21:46
  • You should take a look at the Unicode properties for characters ([Character Properties](http://www.unicode.org/versions/Unicode9.0.0/ch04.pdf), chapter 4 of the Unicode standard). From there, you should decide which properties are acceptable for use in a user name, and then accept only those characters that match your chosen set of properties. – Jonathan Leffler Nov 22 '16 at 22:00
  • Have a look at Unicode [General Categories](https://en.wikipedia.org/wiki/Unicode_character_property#General_Category). You probably want to exclude Separator and Other, and only allow Letter, Mark, Number, Punctuation, and Symbol. – nwellnhof Nov 22 '16 at 22:38
  • @nwellnhof Nice idea. Would be interesting to find a regex for this... – Basj Nov 22 '16 at 22:50
  • It should be possible with the [`regex` module](http://stackoverflow.com/questions/1832893/python-regex-matching-unicode-properties). The regex would look like `(\p{L}\p{M}\p{N}\p{P}\p{S})+`. But there are other issues you should consider like normalization, Zalgo text, or characters from different scripts with the same graphical representation (Cyrillic A vs. Latin A, for example). This really is a broad question... – nwellnhof Nov 23 '16 at 11:32

1 Answers1

2

You can use flag re.UNICODE and unicode in regex expression, \u200b is not technically defined as whitespace

python 2.7 and 3

import re
username = u'My \u200bNick \u2602 \u263b \u200c '
white_chars = ['\s', u'\u200b',u'\u200c']     #etc
regex_str = '[' + ''.join(white_chars) + ']'
regex = re.compile(regex_str, flags=re.UNICODE)
regex.sub("", username )
print ( regex.sub("", username ) )

you get

u'MyNick\u2602\u263b'
MyNick☂☻
Jose Ricardo Bustos M.
  • 8,016
  • 6
  • 40
  • 62