1

i have this spell checker which i have writted:

import operator

class Corrector(object):

    def __init__(self,possibilities):
        self.possibilities = possibilities

    def similar(self,w1,w2):
        w1 = w1[:len(w2)]
        w2 = w2[:len(w1)]
        return sum([1 if i==j else 0 for i,j in zip(w1,w2)])/float(len(w1))

    def correct(self,w):
        corrections = {}
        for c in self.possibilities:
            probability = self.similar(w,c) * self.possibilities[c]/sum(self.possibilities.values())
            corrections[c] = probability
        return max(corrections.iteritems(),key=operator.itemgetter(1))[0]

here possibilities is a dictionary like:

{word1:value1} where value is the number of times the word appeared in the corpus.

The similar function returns the probability of similarity between the words: w1 and w2.

in the correct function, you see that the software loops through all possible outcomes and then computes a probability for each of them being the correct spelling for w.

can i speed up my code by somehow removing the loop?

now i know there might be no answer to this question, if i can't just tell me that i cant!

tenstar
  • 9,816
  • 9
  • 24
  • 45
  • note: the implementation of `similar` is quite naive, it doesn't work well with character erasure (eg: iplementation) – Karoly Horvath Jun 30 '13 at 09:07
  • If I were you, I'd actually use the answer Inbar Rose supplied here: http://stackoverflow.com/a/17388505/1971805 to check similarity – TerryA Jun 30 '13 at 09:09
  • any performance benifits @Haidro ? – tenstar Jun 30 '13 at 09:12
  • 1
    note2: http://norvig.com/spell-correct.html – Karoly Horvath Jun 30 '13 at 09:13
  • 1
    @tenstar Perhaps, but the one Inbar supplied is **much** more accurate – TerryA Jun 30 '13 at 09:13
  • There is a C "Levenshtein" module on pypi: https://pypi.python.org/pypi/python-Levenshtein/. It is quite faster than difflib, and provides the same utilities (quickratio, etc.) plus some other (levenshtein, hamming...). If you want speed, it is probably your best bet. – michaelmeyer Jun 30 '13 at 11:57

3 Answers3

1

Here you go....

from operator import itemgetter
from difflib import SequenceMatcher

class Corrector(object):

    def __init__(self, possibilities):
        self.possibilities = possibilities
        self.sums = sum(self.possibilities.values())

    def correct(self, word):
        corrections = {}
        sm = SequenceMatcher(None, word, '')
        for w, t in self.possibilities.iteritems():
            sm.b = w
            corrections[w] = sm.ratio() * t/self.sums
        return max(corrections.iteritems(),key=itemgetter(1))[0]
Inbar Rose
  • 41,843
  • 24
  • 85
  • 131
  • yes, that might be faster but then is there any way to just bypass the for loop? – tenstar Jun 30 '13 at 10:21
  • It *is* faster, check for yourself. But no, there is no way possible to check every word in a list without looping over that list. There is nothing wrong with looping - however, if you want you can specify break conditions, like if the ratio is 100% then you can immediately return that word etc etc.... No matter what you do though - somewhere in the code you will have to go word for word, whether iterating in real-time or preparing a dictionary beforehand... You will need to loop. – Inbar Rose Jun 30 '13 at 11:42
  • but then if your word corpus has a trillion entries, it'll just take me years (not really) to find my correct spelling! – tenstar Jul 05 '13 at 11:41
0

You could simply cache the result of correct, so that on the next call with the same world you will already know the answer without any computation.

Karoly Horvath
  • 94,607
  • 11
  • 117
  • 176
  • why the downvote? it's a big improvement performancewise, with hardly any coding. – Karoly Horvath Jun 30 '13 at 09:34
  • I also don't get the downvote, but you can do one better here: use decorators to memoize the function. You can just copy the memoize decorator from any of the many examples online and then it's just a matter of adding `@memoize` to the function definition. – Livius Jun 30 '13 at 09:37
  • well, I didn't mention how to *implement* the caching, what you mentioned is one way. you probably want to do it so it works on application level, reloading last state at startup. – Karoly Horvath Jun 30 '13 at 09:39
  • Saving the cache state is definitely a good idea here; I wasn't even thinking about multiple runs. I also am not sure why you would implement a spell checker and not use one of the ones already available. Don't reinvent the wheel and all that. – Livius Jun 30 '13 at 09:42
0

You typically don't want to check the submitted token against all the tokens in your corpus. The "classic" way to reduce the necessary computations (and thus to reduce the calls in your for loop) is to maintain an index of all the (tri-)grams present in your document collection. Basically, you maintain a list of all the tokens of your collection on the one side, and, on the other side, an hash table which keys are the grams, and which values are the index of the tokens in the list. This can be made persistent with a DBM-like database.

Then, when it comes about checking the spelling of a word, you split it into grams, search for all the tokens in your collection that contain the same grams, sort them by gram similarity with the submitted token, and then, you perform your distance-computations.

Also, some parts of your code could be simplified. For example, this:

def similar(self,w1,w2):
    w1 = w1[:len(w2)]
    w2 = w2[:len(w1)]
    return sum([1 if i==j else 0 for i,j in zip(w1,w2)])/float(len(w1))

can be reduced to:

def similar(self, w1, w2, lenw1):
    return sum(i == j for i, j in zip(w1,w2)) / lenw1

where lenw1 is the pre-computed length of "w1".

michaelmeyer
  • 7,985
  • 7
  • 30
  • 36
  • then can i overcome the for loop? atleast partially? @doukremt – tenstar Jun 30 '13 at 14:57
  • @tenstar: Sure. Basically, you keep, say, the 100~150 more similar tokens (in term of grams distribution), and you compute the distance between them and the token in the query. To give you an idea, for a spellchecker I wrote which has a vocabulary size of 630000 tokens (all data is stored on disk), it takes 1~1.5s to find the more relevant token. If you want more speed, keep the hash table in memory (then the search time falls down to ~0.5s – michaelmeyer Jun 30 '13 at 15:55