How to find a word from arrays of characters?

Question

What is the best way to solve this:

I have a group of arrays with 3-4 characters inside each like so:

{p,     {a,    {t,    {m,
 q,      b,     u,     n,
 r,      c      v      o
 s      }      }      }
}

I also have an array of dictionary words.

What is the best/fastest way to find if the array of characters can combine to form one of the dictionary words? For example, the above arrays could make the words:

"pat","rat","at","to","bum"(lol)
but not "nub" or "mat"

Should i loop through the dictionary to see if words can be made or get all the combinations from the letters then compare those to the dictionary

What is the total size of the array, what is the size of your dictionary? — dfb, May 16 '11 at 20:26
no, im making an iphone app so i don't want to give the full details, — GeeGoldz, May 16 '11 at 20:26
the dictionary is 200,000 and the array contains 10 arrays of 3-4 letter arrays — GeeGoldz, May 16 '11 at 20:26
200k words, i got it from the mac /usr/share/dict/words file, a lot of them aren't really words but it was the easiest to get — GeeGoldz, May 16 '11 at 20:28
@GWW, if it was homework it would probably be a better formed question — John La Rooy, May 16 '11 at 20:34
how are the arrays of characters "combined". Can you give a few example result strings or pseudocode? — Jose_X, May 16 '11 at 20:41
Do you only need to get a YES/NO answer? I.e., is the dictionary in the word? Or do you also need to know which words you can potentially create? — csl, May 16 '11 at 21:01

score 19 · Accepted Answer · edited May 17 '21 at 07:18

I had some Scrabble code laying around, so I was able to throw this together. The dictionary I used is sowpods (267751 words). The code below reads the dictionary as a text file with one uppercase word on each line.

The code is C#:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;
using System.Diagnostics;

namespace SO_6022848
{
  public struct Letter
  {
    public const string Chars = "ABCDEFGHIJKLMNOPQRSTUVWXYZ";
    public static implicit operator Letter(char c)
    {
      return new Letter() { Index = Chars.IndexOf(c) };
    }
    public int Index;
    public char ToChar()
    {
      return Chars[Index];
    }
    public override string ToString()
    {
      return Chars[Index].ToString();
    }
  }

  public class Trie
  {
    public class Node
    {
      public string Word;
      public bool IsTerminal { get { return Word != null; } }
      public Dictionary<Letter, Node> Edges = new Dictionary<Letter, Node>();
    }

    public Node Root = new Node();

    public Trie(string[] words)
    {
      for (int w = 0; w < words.Length; w++)
      {
        var word = words[w];
        var node = Root;
        for (int len = 1; len <= word.Length; len++)
        {
          var letter = word[len - 1];
          Node next;
          if (!node.Edges.TryGetValue(letter, out next))
          {
            next = new Node();
            if (len == word.Length)
            {
              next.Word = word;
            }
            node.Edges.Add(letter, next);
          }
          node = next;
        }
      }
    }

  }

  class Program
  {
    static void GenWords(Trie.Node n, HashSet<Letter>[] sets, int currentArrayIndex, List<string> wordsFound)
    {
      if (currentArrayIndex < sets.Length)
      {
        foreach (var edge in n.Edges)
        {
          if (sets[currentArrayIndex].Contains(edge.Key))
          {
            if (edge.Value.IsTerminal)
            {
              wordsFound.Add(edge.Value.Word);
            }
            GenWords(edge.Value, sets, currentArrayIndex + 1, wordsFound);
          }
        }
      }
    }

    static void Main(string[] args)
    {
      const int minArraySize = 3;
      const int maxArraySize = 4;
      const int setCount = 10;
      const bool generateRandomInput = true;

      var trie = new Trie(File.ReadAllLines("sowpods.txt"));
      var watch = new Stopwatch();
      var trials = 10000;
      var wordCountSum = 0;
      var rand = new Random(37);

      for (int t = 0; t < trials; t++)
      {
        HashSet<Letter>[] sets;
        if (generateRandomInput)
        {
          sets = new HashSet<Letter>[setCount];
          for (int i = 0; i < setCount; i++)
          {
            sets[i] = new HashSet<Letter>();
            var size = minArraySize + rand.Next(maxArraySize - minArraySize + 1);
            while (sets[i].Count < size)
            {
              sets[i].Add(Letter.Chars[rand.Next(Letter.Chars.Length)]);
            }
          }
        }
        else
        {
          sets = new HashSet<Letter>[] { 
          new HashSet<Letter>(new Letter[] { 'P', 'Q', 'R', 'S' }), 
          new HashSet<Letter>(new Letter[] { 'A', 'B', 'C' }), 
          new HashSet<Letter>(new Letter[] { 'T', 'U', 'V' }), 
          new HashSet<Letter>(new Letter[] { 'M', 'N', 'O' }) };
        }

        watch.Start();
        var wordsFound = new List<string>();
        for (int i = 0; i < sets.Length - 1; i++)
        {
          GenWords(trie.Root, sets, i, wordsFound);
        }
        watch.Stop();
        wordCountSum += wordsFound.Count;
        if (!generateRandomInput && t == 0)
        {
          foreach (var word in wordsFound)
          {
            Console.WriteLine(word);
          }
        }
      }
      Console.WriteLine("Elapsed per trial = {0}", new TimeSpan(watch.Elapsed.Ticks / trials));
      Console.WriteLine("Average word count per trial = {0:0.0}", (float)wordCountSum / trials);
    }

  }

}

Here is the output when using your test data:

PA
PAT
PAV
QAT
RAT
RATO
RAUN
SAT
SAU
SAV
SCUM
AT
AVO
BUM
BUN
CUM
TO
UM
UN
Elapsed per trial = 00:00:00.0000725
Average word count per trial = 19.0

And the output when using random data (does not print each word):

Elapsed per trial = 00:00:00.0002910
Average word count per trial = 62.2

EDIT: I made it much faster with two changes: Storing the word at each terminal node of the trie, so that it doesn't have to be rebuilt. And storing the input letters as an array of hash sets instead of an array of arrays, so that the Contains() call is fast.

One remark : in this implementation, a `Dictionary` is created for all possible nodes of the `trie`, which is very memory hungry. A possible optimization would be to create use a list or array to store children nodes (that you can keep sorted using a binary search) or even better, to link nodes together using some `previous` and `next` references (like in a `linked list`). I know that a list (and a linked list) have `O(N)` performance (vs `O(1)` for `Dictionary`) but since the number of element is pretty small here (maximum the number of letters in the alphabet) it will not hurt that much. — tigrou, Nov 05 '13 at 11:13
So, it will be about 26/2=13 times slower and you call that not hurting much? — Fantius, Nov 05 '13 at 16:53
`Dictionary` is not really `O(1)` (there is collisions). It also need to perform several things before retrieving an element (like calculating the `hashcode`). If I take your example that would mean checking if a value is in a `List` of 4 elements would be 2 times slower than using a `Dictionary` (while `List` will beat it hand down, iterating trough 4 elements is nothing). In the case of the `trie`, yes using `List` is slower, but definitely not 13 times slower. And if you keep the list sorted you can perform a binary search. It will take `O(log n)` which for 26 letters means 5 iterations. — tigrou, Nov 09 '13 at 00:24
V. useful Trie for another SO answer: https://stackoverflow.com/a/47277064/154355. However, the Trie has a bug when adding a word which is a substring of an existing word. For example, add Emergency then Emerge. The Trie will not actually add Emerge to the list, because you can only have a word at the terminal. My modification (see https://dotnetfiddle.net/jGbnCH) allows for words to be stored at intermediate nodes. — Mark Glasgow, Nov 14 '17 at 04:15

csl · Answer 2 · 2011-05-20T10:39:47.213

3

There are probably many way of solving this.

What you are interested in is the number of each character you have available to form a word, and how many of each character is required for each dictionary word. The trick is how to efficiently look up this information in the dictionary.

Perhaps you can use a prefix tree (a trie), some kind of smart hash table, or similar.

Anyway, you will probably have to try out all your possibilities and check them against the dictionary. I.e., if you have three arrays of three values each, there will be 3^3+3^2+3^1=39 combinations to check out. If this process is too slow, then perhaps you could stick a Bloom filter in front of the dictionary, to quickly check if a word is definitely not in the dictionary.

EDIT: Anyway, isn't this essentially the same as Scrabble? Perhaps try Googling for "scrabble algorithm" will give you some good clues.

edited May 20 '11 at 10:39

answered May 16 '11 at 20:33

csl

10,937
5
57
89

1

+1 - You'll still have to do some sort of DFS on the trie, it seems, though. In theory this could take O(# characters in the dictionary) but I bet in practice it would work fine. Anagram generators (e.g., Scrabble) I think generate all permutations of the letters which is probably infeasible for 30-40 characters/ – dfb May 16 '11 at 20:43
@spinning_plate: Yeah, I thought this was easy, but seems it actually can be a bit tricky. Maybe a DAG could work as well and be efficient? – csl May 16 '11 at 20:46
it's kind of similar to scrabble but the user will enter some number in which each digit corresponds to one of the letter groups – GeeGoldz May 16 '11 at 20:46
@csl - I think the trie is the right track (though a trie is a DAG, so maybe were just saying the same thign). You'd just have traverse it differently. Create a count of all the input letters, traverse to the next node only if we have the character, decrement and recurse. – dfb May 16 '11 at 20:48
that's a very different problem, I don't think I explained it well. There is 10 arrays of characters in a random order, each array contains 3-4 letters. So the order could be: {d,e,f} {a,b,c} {p,q,r,s} etc. I need an algorithm that finds words like "far" or "ear" in the order that the arrays are in. – GeeGoldz May 16 '11 at 21:00
@user: But you need to use all your arrays? I.e, the word "fa" wouldn't be an option? In your example, only three-letter words? – csl May 16 '11 at 21:04
not all arrays need to be used but they can't be in any order like "ber" – GeeGoldz May 16 '11 at 21:06
Ok, I get it. Perhaps you could update the question a little bit, in case somebody has a good solution to this? – csl May 16 '11 at 21:08

dfb · Answer 3 · 2011-05-16T22:54:32.950

The reformulated question can be answered just by generating and testing. Since you have 4 letters and 10 arrays, you've only got about 1 million possible combinations (10 million if you allow a blank character). You'll need an efficient way to look them up, use a BDB or some sort of disk based hash.

The trie solution previously posted should work as well, you are just restricted more by what characters you can choose at each step of the search. It should be faster as well.

score 1 · Answer 4 · answered May 24 '11 at 19:35

I just made a very large nested for loop like this:

for(NSString*s1 in [letterList objectAtIndex:0]{
    for(NSString*s2 in [letterList objectAtIndex:1]{
       8 more times...
    }
}

Then I do a binary search on the combination to see if it is in the dictionary and add it to an array if it is

How to find a word from arrays of characters?

4 Answers4

Linked