How to efficiently use spell correction for a large text corpus in Python

Question

Consider the following for spell-correction:

from autocorrect import spell
import re

WORD = re.compile(r'\w+')
def reTokenize(doc):
    tokens = WORD.findall(doc)
    return tokens

text = ["Hi, welcmoe to speling.","This is jsut an exapmle, but cosnider a veri big coprus."]
def spell_correct(text):
    sptext = []
    for doc in text:
        sptext.append(' '.join([spell(w).lower() for w in reTokenize(doc)]))      
    return sptext    

print(spell_correct(text))

Here is the output for above piece of code:

How I can stop displaying the output in jupyter notebook? Particularly if we have a large number of text documents, it will be lots of outputs.

My second question is: how can I improve the speed and accuracy (please check the word "veri" in the output for example) of the code when applying on a large data? Is there any better way to do this? I appreciate your response and (alternative) solutions with better speed.

Apparently `autocorrect.spell` is deprecated. Presumably if you use `autocorrect.Speller` instead, you won't get those messages any more. — khelwood, Jul 09 '20 at 07:52

score 6 · Accepted Answer · answered Jul 09 '20 at 08:13

6

As @khelwood said in the comments, you should use autocorrect.Speller:

from autocorrect import Speller
import re


spell=Speller(lang="en")
WORD = re.compile(r'\w+')
def reTokenize(doc):
    tokens = WORD.findall(doc)
    return tokens

text = ["Hi, welcmoe to speling.","This is jsut an exapmle, but cosnider a veri big coprus."]
def spell_correct(text):
    sptext = []
    for doc in text:
        sptext.append(' '.join([spell(w).lower() for w in reTokenize(doc)]))      
    return sptext    

print(spell_correct(text)) 

#Output
#['hi welcome to spelling', 'this is just an example but consider a veri big corpus']

As an alternative, you could use a list comprehension to maybe increase the speed, and also you could use the library pyspellchecker, which improves the accuracy of the word 'veri' in this case:

from spellchecker import SpellChecker
import re

WORD = re.compile(r'\w+')
spell = SpellChecker()

def reTokenize(doc):
    tokens = WORD.findall(doc)
    return tokens

text = ["Hi, welcmoe to speling.","This is jsut an exapmle, but cosnider a veri big coprus."]

def spell_correct(text):
    sptext =  [' '.join([spell.correction(w).lower() for w in reTokenize(doc)])  for doc in text]    
    return sptext    

print(spell_correct(text))

Output:

['hi welcome to spelling', 'this is just an example but consider a very big corpus']

answered Jul 09 '20 at 08:13

MrNobody33

6,413
7
19

Thanks MrNobody33, your first one works well, but the second one require indexer module. It seems "indexer" is not supported by Python 3.7; https://stackoverflow.com/questions/57602566/pip-install-indexer-error-in-python-3-7-in-windows-10. – Sam S. Jul 09 '20 at 08:25
1

Did you install it like `pip install spellchecker`? I get the same error of "indexer" with that way. Try with `pip install pyspellchecker`, how it´s recommended in the docs. – MrNobody33 Jul 09 '20 at 08:28
Thanks it works now, but the second method is slower. – Sam S. Jul 09 '20 at 08:56
1

Maybe try with `' '.join(map(lambda x: spell.correction(x).lower(),reTokenize(doc)))` – MrNobody33 Jul 09 '20 at 09:34
Or idk if maybe it's the library, you can try with the first option but using a list comprehension – MrNobody33 Jul 09 '20 at 09:35
Thanks. For text = ["This is jsut an exapmle, but cosnider a veri big coprus."]*50000, it takes around 25 and 125 seconds for methods 1 and 2 in my laptop, respectively. List comprehension does not affect significantly. How this can be done using parallel packages or using "ray" module? – Sam S. Jul 09 '20 at 09:50

How to efficiently use spell correction for a large text corpus in Python

1 Answers1