I have an array of words and and I would like to remove all words that contain any unusual characters like umlauts, accents, etc. (I know there are ways to normalize them to regular characters instead, but I specifically want to remove them).
My idea so far is to create an array of accepted characters (letters a-z), because this is easier than making a blacklist accounting for all possible combinations of letters and accents, and go through my array of words checking if the word has any characters other than the accepted ones.
I've found this article that describes the opposite, removing all words that do contain a certain characters:
filtered_tokens = [w for w in tokens if all(ch not in w for ch in accepted_characters)]
unfortunately a simple negation doesn't make this work for my problem.
I'm open to suggestions and new approaches altogether, but ideally I would like to get this to work with no additional packages besides maybe nltk, which I am using to extract my words from a text.