Separating non meaning full words and meaning full words from list of words using python

Question

I have extracted text from an image using Easyocr and I found so many spelling mistakes in the list of words. In that, I need to separate and find the Number of meaning full and Non-meaning full words or spell mistaken words.

I have this:

example = ["kaaggl","woryse","good","hey","otherwise","orrsy","taken","sometimes"]

I need like this:

meaning_full_words = ["good","hey","otherwise","taken","sometimes"]

Non-meaning_full_words  = ["kaaggl","woryse","orrsy"]

please, help me if is there any possible way to do it I have a huge dataset.

Does this answer your question? [How to check if a word is an English word with Python?](https://stackoverflow.com/questions/3788870/how-to-check-if-a-word-is-an-english-word-with-python) — ygorg, Feb 22 '22 at 09:43

score 0 · Answer 1 · answered Feb 22 '22 at 05:06

0

You want to iterate through the list of words and check each one against the English dictionary. A library such as PyEnchant has the functionality you need.

answered Feb 22 '22 at 05:06

ppeko

71
1
2

Faraaz Kurawle · Answer 2 · 2022-02-22T05:37:34.213

You want to check words if there are meaning full or complete, then you can use this module:

Some example's:

To check for mistake

import language_check
tool = language_check.LanguageTool('en-US')
text = u'A sentence with a error in the Hitchhiker’s Guide tot he Galaxy'
matches = tool.check(text)

To correct it:

language_check.correct(text, matches)

You can use a for loop to iterate between list and to sort out correct words and wrong words.

Alternatively you can compete the words in a dictionary. This link would be helpful

score 0 · Answer 3 · answered Mar 21 '22 at 07:43

I see you tagged your question with spacy, so let me offer a solution that uses Spacy.

You could use the .is_oov from a token class to check if something is in the spacy vocabulary.

This would result in the following code that works in your example:

import spacy
# NLP pipeline
nlp = spacy.load('en_core_web_lg')

example = ["kaaggl", "woryse", "good", "hey", "otherwise", "orrsy", "taken", "sometimes"]
# Tokenize the new sentence
doc = nlp(' '.join(example))

meaning_full_words = []
non_meaning_full_words = []
for token in doc:
    if token.is_oov:
        non_meaning_full_words.append(token.text)
    else:
        meaning_full_words.append(token.text)

If you now check the results you got your list. You could also do this word for word, but as you are only interested in 'real' words you can join them and tokenized them all together as one sentence. It would be more efficient to drop some parts of the spacy pipeline as you are not using them.

Separating non meaning full words and meaning full words from list of words using python

3 Answers3