Identify characterset in Python

Question

I have a challenge parsing about 2 billion lines of strings while I want to identify and classify the maximum range of characters used

Input:

Example1: "456624"

Example2: "generalkenobi"

Example3: "Admiral!2"

Output

Example1: 10 char (0123456789)

Example2: 26 char (AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwXxYyZz

Example3: 96 char (0123456789AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwXxYyZz !"#$%&'()*+,-./:;<=>?@[]^_`{|}~)

Is there any easy way to do this, until now I have tried different versions of if and isdigit(), islower(), isupper() and string.punctuation

Define your charsets as… *sets*… like `set('abcdef...')`, then test which set a line is a subset of? `if set(line) <= set('abcdef...'): ...` — deceze, Mar 02 '21 at 15:00
Perhaps `ord()` might help. This will convert a char to its numeric value in the ASCII table. Then, using min/max you can determine the range of chars used. (Obviously more logic will be involved, but a starting point.) — S3DEV, Mar 02 '21 at 15:02
@deceze Thank you for the answer. I am trying to implement it, how would you go about comparing with multiple sets? set(line) <= set(Numbers) and set(line) <= set(Lower): does not work for me — hexapods, Mar 02 '21 at 16:25
`chars10 = set('0123...'); chars26 = set('abc...')` `if set(line) <= chars10: print("It's 10 chars")` — Basically along this line. How *exactly* you'll structure this and what exactly you want to get out of it is up to you. — deceze, Mar 02 '21 at 16:28

Thomas Kimber · Answer 1 · 2021-03-02T15:24:34.797

There are a lot of different and arbitrary ways of grouping characters into "sets" - but a common method is to block their underlying numeric encodings into ranges. So ASCII (and conveniently, UTF-8 inherits these mappings and more) for example sets out printable alphabetic, numeric and punctuation characters into numbers between 32 and 127.

32 - 47  Space and Punctuation
48 - 57  Numbers 0 - 9
58 - 64 Some more punctuation
65 - 90 Uppercase Alphabetic Characters
95 - 96 Some more punctuation
97 - 122 Lower case Alphabetic Characters
123 - 127 More punctuation.

Extensions to this basic set are also numbered consecutively and cover additional diacritic characters featured in other languages, entire new alphabets and in some cases long-lost civilisations. Just get accustomed to the UTF-8 (and later UTF-16) specifications and what numeric blocks you want to slice up on, and look at the raw bytes used to encode the strings you are looking at.

To begin with though, the ASCII charset between 32 and 127 should give you a starting point to work with.

You can load the data as raw bytes and read the numeric value of each byte to get these numeric ranges, or load into a string and use your_string.encode("utf-8") to get the equivalent list of numeric byte values. These should tend to fall into the ranges described here.

You could curate a collection of byte ranges to fit your requirements, including or excluding specific characters depending on your own requirements.

Alternately, if you're happy to stay within string-land, just set out your character sets as collections of valid characters and match against those using a function.

Something like:

char_d = { "upper_case_alpha" : "ABCDEFGHIJKLMNOPQRSTUVWXYZ", 
           "lower_case_alpha" : "abcdefghijklmnopqrstuvwxyz",
           "full_alphabet"    : "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz",
           "digits"           : "0123456789" }

def tag_charsets(some_string, char_set_map):
    matching_tags = []
    for n,s in char_set_map.items():
        if all([c in s for c in some_string if c != " "]): # Explicitly excluding a space character from the match
            matching_tags.append(n)
    return matching_tags

tag_charsets("Mary had a little lamb", char_d)

Which should return a list containing :

["full_alphabet"]

While,

tag_charsets("mary had a little lamb", char_d)

Would return:

['lower_case_alpha', 'full_alphabet']

Since both those sets are found within the amended lowercase string.

Here I'm only tagging a given character set if the entire string conforms. There might be more convenient methods for your use-case but you could easily edit the logic to do whatever it is you want. It may be the case you want to choose the most-specific single "set" that is valid for a given string for example.

Identify characterset in Python

1 Answers1