2

So, I've written a spellchecker in Java and things work as they should. The only problem is that if I use a word where the max allowed distance of edits is too large (like say, 9) then my code runs out of memory. I've profiled my code and dumped the heap into a file, but I don't know how to use it to optimize my code.

Can anyone offer any help? I'm more than willing to put up the file/use any other approach that people might have.

-Edit-

Many people asked for more details in the comments. I figured that other people would find them useful, and they might get buried in the comments. Here they are:

  • I'm using a Trie to store the words themselves.

  • In order to improve time efficiency, I don't compute the Levenshtein Distance upfront, but I calculate it as I go. What I mean by this is that I keep only two rows of the LD table in memory. Since a Trie is a prefix tree, it means that every time I recurse down a node, the previous letters of the word (and therefore the distance for those words) remains the same. Therefore, I only calculate the distance with that new letter included, with the previous row remaining unchanged.

  • The suggestions that I generate are stored in a HashMap. The rows of the LD table are stored in ArrayLists.

Here's the code of the function in the Trie that leads to the problem. Building the Trie is pretty straight forward, and I haven't included the code for the same here.

/*
 * @param letter: the letter that is currently being looked at in the trie
 *        word: the word that we are trying to find matches for
 *        previousRow: the previous row of the Levenshtein Distance table
 *        suggestions: all the suggestions for the given word
 *        maxd: max distance a word can be from th query and still be returned as suggestion
 *        suggestion: the current suggestion being constructed
 */


public void get(char letter, ArrayList<Character> word, ArrayList<Integer> previousRow, HashSet<String> suggestions, int maxd, String suggestion){

// the new row of the trie that is to be computed.
ArrayList<Integer> currentRow = new ArrayList<Integer>(word.size()+1);
currentRow.add(previousRow.get(0)+1);

int insert = 0;
int delete = 0;
int swap = 0;
int d = 0;

for(int i=1;i<word.size()+1;i++){
    delete = currentRow.get(i-1)+1;
    insert = previousRow.get(i)+1;

    if(word.get(i-1)==letter)
    swap = previousRow.get(i-1);
    else
    swap = previousRow.get(i-1)+1;

    d = Math.min(delete, Math.min(insert, swap));
    currentRow.add(d);
}

// if this node represents a word and the distance so far is <= maxd, then add this word as a suggestion
if(isWord==true && d<=maxd){
    suggestions.add(suggestion);
    }

// if any of the entries in the current row are <=maxd, it means we can still find possible solutions. 
// recursively search all the branches of the trie
for(int i=0;i<currentRow.size();i++){
    if(currentRow.get(i)<=maxd){
    for(int j=0;j<26;j++){
        if(children[j]!=null){
        children[j].get((char)(j+97), word, currentRow, suggestions, maxd, suggestion+String.valueOf((char)(j+97))); 
        }
    }
    break;
    }   
}
}
trincot
  • 317,000
  • 35
  • 244
  • 286
efficiencyIsBliss
  • 3,043
  • 7
  • 38
  • 44
  • 2
    The first thing to tell us is what algorithm and data structures you used to implement a spellchecker, code might be nice. – wkl Nov 03 '11 at 17:10
  • Since you are running out of memory, you need to find what data structures are using the most memory and optimise them so they use less. To be more specific we would need more details. – Peter Lawrey Nov 03 '11 at 17:12
  • @efficiencyIsBliss: a spellchecker should typically generates a list of "candidates" by using several methods, including using the "double metaphone" method or similar (finding valid words that sound like your word), generating a lot of possible variants (insert any one letter, remove each single letter, invert letters, etc. and keep the valid words you get). Then you take all your candidates and compute the edit distance (using the brutally fast DP algo to compute the LED) and suggest first the words with the lowest edit distance. How comes you generate up to an edit distance of 9? – TacticalCoder Nov 03 '11 at 17:20
  • @efficiencyIsBliss: My point is: you are the one generating the candidates and hence you should be able to dodge any combinatorial explosion. The dynamic programming algo to compute the Levenhstein Edit Distance is blazingly fast and can be implemented so that it doesn't create a single object. Here's an old but cool write-up on the subject you may like (including the dynamic programming LED): http://www.ibm.com/developerworks/java/library/j-jazzy/ – TacticalCoder Nov 03 '11 at 17:26
  • @efficiencyIsBliss: seen the code you posted I think you somehow inverted your logic there. I'll post some (working) code outlining how a simple spellchecker works. – TacticalCoder Nov 03 '11 at 17:38
  • @user988052 My code works as it should (or atleast, I haven't found any problems yet). Would you mind explaining why you think my logic is inverted? – efficiencyIsBliss Nov 03 '11 at 17:48
  • @efficiencyIsBliss: see my answer. I don't know why you play trick with "rows" etc. *(disclaimer: I've actually coded a spellchecker back in the days where memory did matter, taking great care of bitpacking the dictionary as to have a minimal footprint ; )* – TacticalCoder Nov 03 '11 at 17:53

3 Answers3

4

Here's some code I quickly crafted showing one way to generate the candidates and to then "rank" them.

The trick is: you never "test" a non-valid candidate.

To me your: "I run out of memory when I've got an edit distance of 9" screams "combinatorial explosion".

Of course to dodge a combinatorial explosion you don't do thing like trying to generate yourself all words that are at a distance from '9' from your misspelled work. You start from the misspelled word and generate (quite a lot) of possible candidates, but you refrain from creating too many candidates, for then you'd run into trouble.

(also note that it doesn't make much sense to compute up to a Levenhstein Edit Distance of 9, because technically any word less than 10 letters can be transformed into any other word less than 10 letters in max 9 transformations)

Here's why you simply cannot test all words up to a distance of 9 without either having an OutOfMemory error or simply a program never terminating:

  • generating all the LED up to 1 for the word "ptmizing", by only adding one letter (from a to z) generates already 9*26 variations (i.e. 324 variations) [there are 9 positions where you can insert one out of 26 letters)
  • generating all the LED up to 2, by only adding one letter to what we know have generates already 10*26*324 variations (60 840)
  • generating all the LED up to 3 gives: 17 400 240 variations

And that is only by considering the case where we add one, add two or add three letters (we're not counting deletion, swaps, etc.). And that is on a misspelled word that is only nine characters long. On "real" words, it explodes even faster.

Sure, you could get "smart" and generate this in a way not to have too many dupes etc. but the point stays: it's a combinatorial explosion that explodes fastly.

Anyway... Here's an example. I'm simply passing the dictionary of valid words (containing only four words in this case) to the corresponding method to keep this short.

You'll obviously want to replace the call to the LED with your own LED implementation.

The double-metaphone is just an example: in a real spellchecker words that do "sound alike" despite further LED should be considered as "more correct" and hence often suggest first. For example "optimizing" and "aupteemising" are quite far from a LED point of view, but using the double-metaphone you should get "optimizing" as one of the first suggestion.

(disclaimer: following was cranked in a few minutes, it doesn't take into account uppercase, non-english words, etc.: it's not a real spell-checker, just an example)

   @Test
    public void spellCheck() {
        final String src = "misspeled";
        final Set<String> validWords = new HashSet<String>();
        validWords.add("boing");
        validWords.add("Yahoo!");
        validWords.add("misspelled");
        validWords.add("stackoverflow");
        final List<String> candidates = findNonSortedCandidates( src, validWords );
        final SortedMap<Integer,String> res = computeLevenhsteinEditDistanceForEveryCandidate(candidates, src);
        for ( final Map.Entry<Integer,String> entry : res.entrySet() ) {
            System.out.println( entry.getValue() + " @ LED: " + entry.getKey() );
        }
    }

    private SortedMap<Integer, String> computeLevenhsteinEditDistanceForEveryCandidate(
            final List<String> candidates,
            final String mispelledWord
    ) {
        final SortedMap<Integer, String> res = new TreeMap<Integer, String>();
        for ( final String candidate : candidates ) {
            res.put( dynamicProgrammingLED(candidate, mispelledWord), candidate );
        }
        return res;
    }

    private int dynamicProgrammingLED( final String candidate, final String misspelledWord ) {
        return Levenhstein.getLevenshteinDistance(candidate,misspelledWord);
    }

Here you generate all possible candidates using several methods. I've only implemented one such method (and quickly so it may be bogus but that's not the point ; )

    private List<String> findNonSortedCandidates( final String src, final Set<String> validWords ) {
        final List<String> res = new ArrayList<String>();
        res.addAll( allCombinationAddingOneLetter(src, validWords) );
//        res.addAll( allCombinationRemovingOneLetter(src) );
//        res.addAll( allCombinationInvertingLetters(src) );
        return res;
    }

    private List<String> allCombinationAddingOneLetter( final String src, final Set<String> validWords ) {
        final List<String> res = new ArrayList<String>();
        for (char c = 'a'; c < 'z'; c++) {
            for (int i = 0; i < src.length(); i++) {
                final String candidate = src.substring(0, i) + c + src.substring(i, src.length());
                if ( validWords.contains(candidate) ) {
                    res.add(candidate); // only adding candidates we know are valid words
                }
            }
            if ( validWords.contains(src+c) ) {
                res.add( src + c );
            }
        }
        return res;
    }
TacticalCoder
  • 6,275
  • 3
  • 31
  • 39
  • +1 for such quick response. I've written such code myself and I agree that its elegant and requires less memory than my approach. However, if you want to find all suggestions within distance 2, you would have to run `allCombinationAddingOneLetter` on all the elements that are at distance 1 from `src`. In this way, if I want to allow a distance of up to 9, I would have to repeatedly call `allCombinationAddingOneLetter` over and over and things could become pretty slow. What do you think? – efficiencyIsBliss Nov 03 '11 at 18:02
  • @efficiencyIsBliss: you can write cleaner methods: I just quickly cranked some code... It's trickier: I only produce valid words. To find elements at a distance of two, you have to write another method, which also finds only valid words. Otherwise you'd start with *"mispeled"* and you would *not* get back "misspelled" because you wouldn't have got "misspeled" (or "mispelled") from the first call. There are ways to write the code generating all the combination relatively cleanly but... You cannot go up to 9, the combinatorial explosion is too big. I'll edit more to give an example. – TacticalCoder Nov 03 '11 at 18:09
  • Could you please include some tips about how to reduce my memory footprint in the above code as well? Thanks for all the help and the discussion man! – efficiencyIsBliss Nov 03 '11 at 18:24
  • @efficiencyIsBliss: it all depends on how big your dictionary is. Heck, nowadays 300 000 *String* object for your dictionary ain't exactly an issue. Moreover you'll only be generating valid words so the code above should not have any memory issue because you'll only be creating relatively small lists/sets/map. The only real memory used should be the one used by the dictionary itself. *(I've edited to show why there's a combinatorial explosion)*. – TacticalCoder Nov 03 '11 at 18:28
  • No, I meant in the code that I posted. I might just end up using what you posted, but I'd still like to know how I could have improved my own code. – efficiencyIsBliss Nov 03 '11 at 18:40
2

One thing you could try is, increase the Java's heap size, in order to overcome "out of memory error".

Following article will help you in order to understand how to increase heap size in Java

http://viralpatel.net/blogs/2009/01/jvm-java-increase-heap-size-setting-heap-size-jvm-heap.html

But I think the better approach to address your problem is, find out a better algorithm than the current algorithm

Upul Bandara
  • 5,973
  • 4
  • 37
  • 60
  • I know how to increase the Java heap size, but I don't think I want to do that. I want to make it more efficient so that it can run in the default heap space. – efficiencyIsBliss Nov 03 '11 at 17:37
0

Well without more Information on the topic there is not much the community could do for you... You can start with the following:

  1. Look at what your Profiler says (after it has run a little while): Does anything pile up? Are there a lot of Objects - this should normally give you a hint on what is wrong with your code.

  2. Publish your saved dump somewhere and link it in your question, so someone else could take a look at it.

  3. Tell us which profiler you are using, then somebody can give you hints on where to look for valuable information.

  4. After you have narrowed down your problem to a specific part of your Code, and you cannot figure out why there are so many objects of $FOO in your memory, post a snippet of the relevant part.

Christian Uhl
  • 362
  • 1
  • 9