Would it be very inefficient to have a repository of words to check a single language against?

Question

I am doing some NLP with Python on YouTube comments I have downloaded, and I only want to process English ones. So far I have experimented with different libraries (many of the ones discussed in this thread) and it works fine for longer strings, but many of the libraries often run into problems with the shorter, one or two worders. My question is whether it would be hopelessly inefficient to download a dictionary of English words and check each of these short, problematic comments against it, obviously discarding the ones that don't match.

I can forsee problems with things such as misspellings or words that appear in English and a foreign language, but at present I am more concerned about speed as I have about 68 million comments to process.

Depends on your memory etc, but for a process executing alone on a modern computer, this should run just fine. Whether it actually works well is another matter; many short phrases have a meaning in English as well as in some other language (for example, "jag tiger" is Swedish, but both words are valid English dictionary words). — tripleee, Mar 15 '21 at 13:40
Excellent! Thanks so much for the help. Yeah I do anticipate homographs to be an issue but I plan on opinion mining the results so I feel that, given the size of the corpus and the relatively low chances of a homograph existing AND happening to have positive/negative connotations in English, the risk is relatively negligible. — Charlie Armstead, Mar 15 '21 at 13:53
For what it's worth, an English spelling checker already does, this with a much more compact memory representation of the English dictionary (something like a few hundred kilobytes). — tripleee, Mar 15 '21 at 14:22

Pavlos Rousoglou · Answer 1 · 2021-03-19T20:57:16.143

Try using NLTK's corpus. Nltk is an external module in python with multiple corpus for natural language processing. Specifically, what interests you is the following:

from nltk.corpus import words
eng_words = words.words("en")

Words.words("en") is a list containing almost 236,000 English words. By converting this into a set would really speed up your word processing. You could test your words against this corpus, and if they exist it means they are English words:

string = "I loved stack overflow so much. Mary had a little lamb"

set_words = set(words.words("en"))

for word in string.split():
    if word in set_words:
        print(word)

Output

I loved stack overflow so much. Mary had a little lamb

If it is a dictionary you are looking for (with proper definitions), I have used @Tushars implementation. It is neatly made and is available for everyone. The format used is:

{WORD: {'MEANINGS':{} , 'ANTONYMS':[...] , 'SYNONYMS':[...]}}

and the 'MEANINGS' dict is arranged as

'MEANINGS':{sense_num_1:[TYPE_1, MEANING_1, CONTEXT_1, EXAMPLES], sense_num_2:[TYPE_2, MEANING_2, CONTEXT_2, EXAMPLES] and so on...}

The file is available here: https://www.dropbox.com/s/qjdgnf6npiqymgs/data.7z?dl=1 More details can be found here: English JSON Dictionary with word, word type and definition

This is very helpful! Why would converting it into a set speed up processing? — Charlie Armstead, Mar 22 '21 at 14:39
Set is implemented by a hash-table data structure. For this reason, checking if a specific value exists in the set, is instant O(1) time. On the other hand to check if a value is included in a list, all elements must be checked in a loop, so that is O(n) time. But the elements of the set are not ordered, not indexed. Note that sets aren't faster than lists in general -- membership test is faster for sets, and so is removing an element. As long as you don't need these operations, lists are often faster. — Pavlos Rousoglou, Mar 22 '21 at 15:26
I see! In the end I used a different library (symspell) because I found that NLTK.words was too slow, but now I see it was because I was iterating through it as a list! — Charlie Armstead, Mar 23 '21 at 12:05

Would it be very inefficient to have a repository of words to check a single language against?

1 Answers1