0

Given a name string, I want to validate a few basic conditions: -The characters belong to a recognized script/alphabet (Latin, Chinese, Arabic, etc) and aren't say, emojis. -The string doesn't contain digits and is of length < 40

I know the latter can be accomplished via regex but is there a unicode way to accomplish the first? Are there any text processing libraries I can leverage?

guitard00d123
  • 347
  • 5
  • 14

1 Answers1

2

You should be able to check this using the Unicode Character classes in regex.

[\p{P}\s\w]{40,}

The most important part here is the \w character class using Unicode mode:

\p{P} matches any kind of punctuation character
\s matches any kind of invisible character (equal to [\p{Z}\h\v])
\w match any word character in any script (equal to [\p{L}\p{N}_])

Live Demo

You may want to add more like \p{Sc} to match currency symbols, etc.

But to be able to take advantage of this, you need to use the regex module (an alternative to the standard re module) that supports Unicode codepoint properties with the \p{} syntax.

# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility

import regex as re

regex = r"[\p{P}\s\w]{40,}"

test_str = ("Wow cool song!Wow cool song!Wow cool song!Wow cool song!  \nWow cool song! Wow cool song! Wow cool song! \n")   
matches = re.finditer(regex, test_str, re.UNICODE | re.MULTILINE)

for matchNum, match in enumerate(matches):
    matchNum = matchNum + 1

    print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))

    for groupNum in range(0, len(match.groups())):
        groupNum = groupNum + 1

        print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))

PS: .NET Regex gives you some more options like \p{IsGreek}.

wp78de
  • 18,207
  • 7
  • 43
  • 71
  • Sorry, I had some problems to add the Arab and Chinese characters but you can find some in the live demo. – wp78de Nov 19 '17 at 05:25