0

I have an array of words and and I would like to remove all words that contain any unusual characters like umlauts, accents, etc. (I know there are ways to normalize them to regular characters instead, but I specifically want to remove them).

My idea so far is to create an array of accepted characters (letters a-z), because this is easier than making a blacklist accounting for all possible combinations of letters and accents, and go through my array of words checking if the word has any characters other than the accepted ones.

I've found this article that describes the opposite, removing all words that do contain a certain characters:

filtered_tokens = [w for w in tokens if all(ch not in w for ch in accepted_characters)]

unfortunately a simple negation doesn't make this work for my problem.

I'm open to suggestions and new approaches altogether, but ideally I would like to get this to work with no additional packages besides maybe nltk, which I am using to extract my words from a text.

Andy
  • 69
  • 4
  • Welcome to Stack Overflow. I would recommend that you consider [this thread](https://stackoverflow.com/questions/4211209), and [this one](https://stackoverflow.com/questions/9227527), and [also this one](https://stackoverflow.com/questions/4162603). – bad_coder Dec 02 '20 at 12:25
  • 1
    @Vlad `str.strip` only removes characters from the left or right of the string, until it cannot remove any more. `"Hüllo"` would fail in your case. – Paul M. Dec 02 '20 at 12:26
  • This problem looks like already asked and solved here [https://stackoverflow.com/a/196392/2681662](https://stackoverflow.com/a/196392/2681662) – MSH Dec 02 '20 at 12:29
  • Removing "unusual" things is nearly always the wrong answer to the problem "I don't know what I'm looking at. I wish it would just go away". – tripleee Dec 02 '20 at 12:36

1 Answers1

0

So, you have a list of words, and you want to remove entire words from that list. You remove a word from the list of words, if any of its characters are not in a list of "good" characters.

from string import ascii_lowercase

alphabet = frozenset(ascii_lowercase)

words = ["hüllo", "World", "youshouldseethis", "thisyoushouldn't"]


def is_valid_word(word):
    return all(char in alphabet for char in word)

filtered_words = list(filter(is_valid_word, words))
print(filtered_words)

Output:

['youshouldseethis']
>>> 

Notice, that the word "World" also doesn't make it to the filtered_words list, since it contains an uppercase letter. In my case, I've defined my list of valid characters to be only lowercase letters (a-z).

To be precise, we aren't removing words from any collection. Instead, we just create a new list, and only retain those words which we consider valid. is_valid_word is a function that acts as a predicate - returning True if all characters in the current word are valid, and False otherwise.

Paul M.
  • 10,481
  • 2
  • 9
  • 15