There are a lot of different and arbitrary ways of grouping characters into "sets" - but a common method is to block their underlying numeric encodings into ranges. So ASCII (and conveniently, UTF-8 inherits these mappings and more) for example sets out printable alphabetic, numeric and punctuation characters into numbers between 32 and 127.
32 - 47 Space and Punctuation
48 - 57 Numbers 0 - 9
58 - 64 Some more punctuation
65 - 90 Uppercase Alphabetic Characters
95 - 96 Some more punctuation
97 - 122 Lower case Alphabetic Characters
123 - 127 More punctuation.
Extensions to this basic set are also numbered consecutively and cover additional diacritic characters featured in other languages, entire new alphabets and in some cases long-lost civilisations. Just get accustomed to the UTF-8 (and later UTF-16) specifications and what numeric blocks you want to slice up on, and look at the raw bytes used to encode the strings you are looking at.
To begin with though, the ASCII charset between 32 and 127 should give you a starting point to work with.
You can load the data as raw bytes and read the numeric value of each byte to get these numeric ranges, or load into a string and use your_string.encode("utf-8")
to get the equivalent list of numeric byte values. These should tend to fall into the ranges described here.
You could curate a collection of byte ranges to fit your requirements, including or excluding specific characters depending on your own requirements.
Alternately, if you're happy to stay within string-land, just set out your character sets as collections of valid characters and match against those using a function.
Something like:
char_d = { "upper_case_alpha" : "ABCDEFGHIJKLMNOPQRSTUVWXYZ",
"lower_case_alpha" : "abcdefghijklmnopqrstuvwxyz",
"full_alphabet" : "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz",
"digits" : "0123456789" }
def tag_charsets(some_string, char_set_map):
matching_tags = []
for n,s in char_set_map.items():
if all([c in s for c in some_string if c != " "]): # Explicitly excluding a space character from the match
matching_tags.append(n)
return matching_tags
tag_charsets("Mary had a little lamb", char_d)
Which should return a list containing :
["full_alphabet"]
While,
tag_charsets("mary had a little lamb", char_d)
Would return:
['lower_case_alpha', 'full_alphabet']
Since both those sets are found within the amended lowercase string.
Here I'm only tagging a given character set if the entire string conforms. There might be more convenient methods for your use-case but you could easily edit the logic to do whatever it is you want. It may be the case you want to choose the most-specific single "set" that is valid for a given string for example.