9

I'm writing an Android word app. My code includes a method that would find all combinations of the string and the substrings of a 7 letter string with a minimum of length 3. Then compare all available combination to every word in the dictionary to find all the valid words. I'm using a recursive method. Here's the code.

// Gets all the permutations of a string.
void permuteString(String beginningString, String endingString) {
    if (endingString.length() <= 1){
        if((Arrays.binarySearch(mDictionary, beginningString.toLowerCase() +   endingString.toLowerCase())) >= 0){
            mWordSet.add(beginningString + endingString);
        }
    }
    else
        for (int i = 0; i < endingString.length(); i++) {
            String newString = endingString.substring(0, i) + endingString.substring(i + 1);
            permuteString(beginningString + endingString.charAt(i), newString);
      }
}
// Get the combinations of the sub-strings. Minimum 3 letter combinations
void subStrings(String s){
    String newString = "";
    if(s.length() > 3){
        for(int x = 0; x < s.length(); x++){
            newString = removeCharAt(x, s);
            permuteString("", newString);
            subStrings(newString);
        }
    }
}

The above code runs fine but when I installed it on my Nexus s I realized that it runs a bit too slow. It takes a few seconds to complete. About 3 or 4 seconds which is unacceptable. Now I've played some word games on my phone and they compute all the combinations of a string instantly which makes me believe that my algorithm is not very efficient and it can be improved. Can anyone help?


public class TrieNode {
TrieNode a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, y, z;
TrieNode[] children = {a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, y, z};
private ArrayList<String> words = new ArrayList<String>();

public void addWord(String word){
    words.add(word);
}
public ArrayList<String> getWords(){
    return words;
}
}

public class Trie {

static String myWord;
static String myLetters = "afinnrty";
static char[] myChars;
static Sort sort;
static TrieNode myNode = new TrieNode();
static TrieNode currentNode;
static int y = 0;
static ArrayList<String> availableWords = new ArrayList<String>();

public static void main(String[] args) {

    readWords();
    getPermutations();
}
public static void getPermutations(){
    currentNode = myNode;
    for(int x = 0; x < myLetters.length(); x++){
        if(currentNode.children[myLetters.charAt(x) - 'a'] != null){
            //availableWords.addAll(currentNode.getWords());
            currentNode = currentNode.children[myLetters.charAt(x) - 'a'];
            System.out.println(currentNode.getWords() + "" + myLetters.charAt(x));
        }
    }
    //System.out.println(availableWords);
}
public static void readWords(){
    try {
        BufferedReader in = new BufferedReader(new FileReader("c://scrabbledictionary.txt"));
        String str;
        while ((str = in.readLine()) != null) {
            myWord = str;
            myChars = str.toCharArray();
            sort = new Sort(myChars);
            insert(myNode, myChars, 0);
        }
        in.close();
    } catch (IOException e) {
    }
}
public static void insert(TrieNode node, char[] myChars, int x){    
    if(x >= myChars.length){
        node.addWord(myWord);
        //System.out.println(node.getWords()+""+y);
        y++;
        return;
    }
    if(node.children[myChars[x]-'a'] == null){
        insert(node.children[myChars[x]-'a'] = new TrieNode(), myChars, x=x+1);
    }else{
        insert(node.children[myChars[x]-'a'], myChars, x=x+1);
    }
}
}
Willem Van Onsem
  • 443,496
  • 30
  • 428
  • 555
zataar
  • 199
  • 1
  • 1
  • 7

6 Answers6

16

In your current approach, you're looking up every permutation of each substring. So for "abc", you need to look up "abc", "acb", "bac", "bca", "cab" and "cba". If you wanted to find all permutations of "permutations", your number of lookups is nearly 500,000,000, and that's before you've even looked at its substrings. But we can reduce this to one lookup, regardless of length, by preprocessing the dictionary.

The idea is to put each word in the dictionary into some data structure where each element contains a set of characters, and a list of all words containing (only) those characters. So for example, you could build a binary tree, which would have a node containing the (sorted) character set "abd" and the word list ["bad", "dab"]. Now, if we want to find all permutations of "dba", we sort it to give "abd" and look it up in the tree to retrieve the list.

As Baumann pointed out, tries are well suited to storing this kind of data. The beauty of the trie is that the lookup time depends only on the length of your search string - it is independent of the size of your dictionary. Since you'll be storing quite a lot of words, and most of your search strings will be tiny (the majority will be the 3-character substrings from the lowest level of your recursion), this structure is ideal.

In this case, the paths down your trie would reflect the character sets rather than the words themselves. So if your entire dictionary was ["bad", "dab", "cab", "cable"], your lookup structure would end up looking like this:

Example trie

There's a bit of a time/space trade-off in the way you implement this. In the simplest (and fastest) approach, each Node contains just the list of words, and an array Node[26] of children. This allows you to locate the child you're after in constant time, just by looking at children[s.charAt(i)-'a'] (where s is your search string and i is your current depth in the trie).

The downside is that most of your children arrays will be mostly empty. If space is an issue, you can use a more compact representation like a linked list, dynamic array, hash table, etc. However, these come at the cost of potentially requiring several memory accesses and comparisons at each node, instead of the simple array access above. But I'd be surprised if the wasted space was more than a few megabytes over your whole dictionary, so the array-based approach is likely your best bet.

With the trie in place, your whole permutation function is replaced with one lookup, bringing the complexity down from O(N! log D) (where D is the size of your dictionary, N the size of your string) to O(N log N) (since you need to sort the characters; the lookup itself is O(N)).

EDIT: I've thrown together an (untested) implementation of this structure: http://pastebin.com/Qfu93E80

Willem Van Onsem
  • 443,496
  • 30
  • 428
  • 555
Nick Barnes
  • 19,816
  • 3
  • 51
  • 63
  • 3
    I get it to 239500800 permutations (not 479001600). Did you count the two ts as different? If the word was "aaa", then I think there would be only 1 permutation, not 6. But other than that a good answer, +1 from me. – Roger Lindsjö Feb 04 '12 at 08:43
  • @Roger Yes, "aaa" has only one distinct permutation, but the asker's code doesn't look for duplicates, so it's still generating 6 copies and doing a lookup for each. – Nick Barnes Feb 04 '12 at 08:54
  • It took me a while to understand this data structure but I finally did (I think). So the node for this kind of tree would have 26 branches. One for each character of the alphabet. And each node would also have a list that would contain the corresponding words for that node. Is this correct? – zataar Feb 06 '12 at 23:13
  • Sounds about right. I've added a couple of paragraphs about the implementation details to my answer. – Nick Barnes Feb 07 '12 at 02:30
  • @NickBarnes So if you had the string "abcd" for the above example, how would you find the permutations "cab", "bad" and "dab" since the tree forks at 'b'? – zataar Feb 09 '12 at 04:49
  • @zataar Those aren't permutations of "abcd", they're permutations of *substrings* of "abcd". This lookup replaces your `permuteString()` function, but your `subStrings()` function would otherwise stay the same. You *could* build all this information into your trie (adding all words for all substrings into every node), but it'll take much longer to construct, much more memory to store, and won't improve your worst-case performance. But if the version I've given is still too slow, let me know and I'll try to elaborate on this alternative. – Nick Barnes Feb 09 '12 at 07:55
  • @NickBarnes I've added the code that I have so far to the main question. Please take a look at it. Thanks – zataar Feb 10 '12 at 01:57
  • @zataar All looks about right; I don't see any logical errors. But a couple of areas could use a bit of a cleanup, and it needs a well-defined interface, instead of a bunch of static variables. Check this out: http://pastebin.com/Qfu93E80 (untested, may not even compile, let alone work ;) ) – Nick Barnes Feb 10 '12 at 03:20
  • @NickBarnes Thanks for the polished code. I will test it later.The problem with this code is that if you give it a real word then it will find all sub words. But if you give it some letters at random then the system breaks down and I'm not sure what to do about it! – zataar Feb 10 '12 at 04:53
  • @zataar I don't quite follow. If you give it some random letters, it should find you all permutations of those random letters. Is that what you're expecting? Where and how does it "break down", exactly? – Nick Barnes Feb 10 '12 at 06:11
  • @NickBarnes I guess I'm still not quite clear about it. For example if you give it the word "infantry" it returns the following []a [fa]f []i [fain, naif]n []n []r []t [infantry]y and if you give it "abdefhilmp" it returns []a []b [ab, ba]d [bad, dab]e [abed, bade, bead]f []i []l Does this look OK to you? – zataar Feb 10 '12 at 06:39
  • @zataar Sorry for the late reply. When you say "it returns the following", I'm not sure exactly what "it" is, or how you're using it. If you could stick your code on pastebin.com, I'll take a look. – Nick Barnes Feb 16 '12 at 22:59
  • @NickBarnes Thanks for the reply. At that time I had missed the part where you said I must still use the subStrings() function. Once I did the Trie worked perfectly but I ran into some new problems. My Nexus S phone started to Force Close and my Android emulator gave an (out of memory error) someone said that this is common on phones due to the limited memory. One solution is to write the Trie data to the SD card. I'm not sure how to write such a data structure to an SD card. Would you have any pointers? – zataar Feb 18 '12 at 17:16
  • @zataar You could try the more space-efficient version I mentioned - store a character at each node (as in the diagram above), and replace the `TrieNode[26]` with an `ArrayList`. Doing it straight from the SD card would be tricky, and much slower. By the way, how much memory is available to your application? How big is your dictionary? How many TrieNodes are created? Does it run out of memory while building the trie, or somewhere else? – Nick Barnes Feb 19 '12 at 00:26
  • @NickBarnes Actually the ArrayList is what I've used. My dictionary's size is 53,818 words which is about 425KB. I've counted 62,766 TrieNodes. It runs out of memory while building the Trie and my phone comes with 512MB of RAM. By the way, I've tried it with about 40,000 words and it worked fine! – zataar Feb 19 '12 at 03:32
  • @zataar That doesn't sound right... I'd expect 62,766 nodes to be well under 10MB, even with all those nearly-empty 26-element arrays. I don't think the space-efficient implementation should need more than 2MB. – Nick Barnes Feb 19 '12 at 04:48
  • @NickBarnes My mistake! It's the Array implementation instead of the ArrayList. How much more memory do you think the Array implementation would require? I've also been reading on this site that Android limits the memory per app to 24MB. I'm not even sure how true this information is! – zataar Feb 19 '12 at 08:13
  • @zataar Each array contains 26 pointers, so 26*62766 in total. But we've only got 62765 *valid* pointers (one to each node, excluding the root). At 4 bytes each, that's ~200kB of valid pointers, and 6MB of nulls. So you can save ~6MB by only storing the valid children. The downside is that you lose the constant-time indexed lookup in the child array - you need to search through the child list at every node for the next character. – Nick Barnes Feb 19 '12 at 10:29
1
  static List<String> permutations(String a) {
    List<String> result=new LinkedList<String>();
    int len = a.length();
    if (len<=1){
      result.add(a);
    }else{
      for (int i=0;i<len; i++){
        for (String it:permutations(a.substring(0, i)+a.substring(i+1))){
          result.add(a.charAt(i)+it);
        }
      }
    }
    return result;
  }
olaf
  • 21
  • 2
1

I don't think adding all permutations is necessary. You can simply encapsulate the string into a PermutationString:

public class PermutationString {

    private final String innerString;

    public PermutationString(String innerString) {
        this.innerString = innerString;
    }

    @Override
    public int hashCode() {
        int hash = 0x00;
        String s1 = this.innerString;
        for(int i = 0; i < s1.length(); i++) {
            hash += s1.charAt(i);
        }
        return hash;
    }

    @Override
    public boolean equals(Object obj) {
        if (obj == null) {
            return false;
        }
        if (getClass() != obj.getClass()) {
            return false;
        }
        final PermutationString other = (PermutationString) obj;
        int nChars = 26;
        int[] chars = new int[nChars];
        String s1 = this.innerString;
        String s2 = other.innerString;
        if(s1.length() != s2.length()) {
            return false;
        }
        for(int i = 0; i < s1.length(); i++) {
            chars[s1.charAt(i)-'a']++;
        }
        for(int i = 0; i < s2.length(); i++) {
            chars[s2.charAt(i)-'a']--;
        }
        for(int i = 0; i < nChars; i++) {
            if(chars[i] != 0x00) {
                return false;
            }
        }
        return true;
    }

}

A PermutationString is a string, but where two PermutationStrings are equal if they have the same frequency of characters. Thus new PermutationString("bad").equals(new PermutationString("dab")). This also holds for the .hashCode(): if the strings are permutations of each other, they will generate the same .hashCode().

Now you can simply a HashMap<PermutationString,ArrayList<String>> as follows:

HashMap<PermutationString,ArrayList<String>> hm = new HashMap<PermutationString,ArrayList<String>>();
String[] dictionary = new String[] {"foo","bar","oof"};
ArrayList<String> items;
for(String s : dictionary) {
    PermutationString ps = new PermutationString(s);
    if(hm.containsKey(ps)) {
        items = hm.get(ps);
        items.add(s);
    } else {
        items = new ArrayList<String>();
        items.add(s);
        hm.put(ps,items);
    }
}

So now we iterate over all possible words in the dictionary, construct a PermutationString as key, and if the key already exists (that means that there is already a word with the same character frequencies), we simply add our own word to it. Otherwise, we add a new ArrayList<String> with the single word.

Now that we have filled up the hm with all permutations (but not that much keys), you can query:

hm.get(new PermutationString("ofo"));

This will return an ArrayList<String> with "foo" and "oof".

Testcase:

HashMap<PermutationString, ArrayList<String>> hm = new HashMap<PermutationString, ArrayList<String>>();
String[] dictionary = new String[]{"foo", "bar", "oof"};
ArrayList<String> items;
for (String s : dictionary) {
    PermutationString ps = new PermutationString(s);
    if (hm.containsKey(ps)) {
        items = hm.get(ps);
        items.add(s);
    } else {
        items = new ArrayList<String>();
        items.add(s);
        hm.put(ps, items);
    }
}
Assert.assertNull(hm.get(new PermutationString("baa")));
Assert.assertNull(hm.get(new PermutationString("brr")));
Assert.assertNotNull(hm.get(new PermutationString("bar")));
Assert.assertEquals(1,hm.get(new PermutationString("bar")).size());
Assert.assertNotNull(hm.get(new PermutationString("rab")));
Assert.assertEquals(1,hm.get(new PermutationString("rab")).size());
Assert.assertNotNull(hm.get(new PermutationString("foo")));
Assert.assertEquals(2,hm.get(new PermutationString("foo")).size());
Assert.assertNotNull(hm.get(new PermutationString("ofo")));
Assert.assertEquals(2,hm.get(new PermutationString("ofo")).size());
Assert.assertNotNull(hm.get(new PermutationString("oof")));
Assert.assertEquals(2,hm.get(new PermutationString("oof")).size());
Willem Van Onsem
  • 443,496
  • 30
  • 428
  • 555
1

See here: How to find list of possible words from a letter matrix [Boggle Solver]

The idea behind the code in the answers is as follows:

  • Iterate over each word dictionary.
  • Iterate over each letter in the word, adding it to a string and adding the string each time to an array of prefixes.
  • When creating string combinations, test to see that they exist in the prefix array before branching any further.
Community
  • 1
  • 1
mowwwalker
  • 16,634
  • 25
  • 104
  • 157
0

Use a Trie

Instead of testing all N! possibilities, you only follow prefix trees that lead to a result. This will significanlty reduce the amount of strings that you're checking against.

The Real Baumann
  • 1,941
  • 1
  • 14
  • 20
0

Well, you can extend your dictionary entities with array letters[] where letters[i] stays for times that i-th letter of alphabet used in this word. It'll take some additional memory, not far much than it is used now.

Then, for each word which permutations you want to check, you'll need to count number of distinct letters too and then traverse through dictiory with easy comparison procedure. If for all letters for word from dictionary number of occurrences less or equal than for word we are checking - yes, this word can be represented as permutation of substring, otherwise - no.

Complexity: it'll took O(D * maxLen) for precalculation, and O(max(N, D)) for each query.

OleGG
  • 8,589
  • 1
  • 28
  • 34
  • The query doesn't need to be O(D). You're searching for a specific `letters[]` array; if you sort your dictionary according to these arrays, you can find the one you're looking for in O(logD). That's pretty much what my solution above is doing. – Nick Barnes Feb 04 '12 at 13:57