You should be able to check this using the Unicode Character classes in regex.
[\p{P}\s\w]{40,}
The most important part here is the \w character class using Unicode mode:
\p{P}
matches any kind of punctuation character
\s
matches any kind of invisible character (equal to [\p{Z}\h\v]
)
\w
match any word character in any script (equal to [\p{L}\p{N}_]
)
Live Demo
You may want to add more like \p{Sc}
to match currency symbols, etc.
But to be able to take advantage of this, you need to use the regex
module (an alternative to the standard re module) that supports Unicode codepoint properties with the \p{}
syntax.
# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility
import regex as re
regex = r"[\p{P}\s\w]{40,}"
test_str = ("Wow cool song!Wow cool song!Wow cool song!Wow cool song! \nWow cool song! Wow cool song! Wow cool song! \n")
matches = re.finditer(regex, test_str, re.UNICODE | re.MULTILINE)
for matchNum, match in enumerate(matches):
matchNum = matchNum + 1
print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
for groupNum in range(0, len(match.groups())):
groupNum = groupNum + 1
print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))
PS: .NET Regex gives you some more options like \p{IsGreek}.