12

Eg if input string is helloworld I want the output to be like:

do
he
we
low
hell
hold
roll
well
word
hello
lower
world
...

all the way up to the longest word that is an anagram of a substring of helloworld. Like in Scrabble for example. The input string can be any length, but rarely more than 16 chars.

I've done a search and come up with structures like a trie, but I am still unsure of how to actually do this.

casperOne
  • 73,706
  • 19
  • 184
  • 253
PowerApp101
  • 1,798
  • 1
  • 18
  • 25
  • how come "do" is a valid string in the above case? It is not an anagram of any substring ryt? – nitish712 May 21 '13 at 02:10
  • @nitish712 "do" is valid because the letters 'd' and 'o' are in the input string. – PowerApp101 Jan 24 '14 at 03:56
  • This is similar to the problem of [finding substrings that are permutations of a string](https://stackoverflow.com/questions/22737407/permutation-of-string-as-substring-of-another). – Anderson Green Apr 15 '22 at 23:13

8 Answers8

14

The structure used to hold your dictionary of valid entries will have a huge impact on efficiency. Organize it as a tree, root being the singular zero letter "word", the empty string. Each child of root is a single first letter of a possible word, children of those being the second letter of a possible word, etc., with each node marked as to whether it actually forms a word or not.

Your tester function will be recursive. It starts with zero letters, finds from the tree of valid entries that "" isn't a word but it does have children, so you call your tester recursively with your start word (of no letters) appended with each available remaining letter from your input string (which is all of them at that point). Check each one-letter entry in tree, if valid make note; if children, re-call tester function appending each of remaining available letters, and so on.

So for example, if your input string is "helloworld", you're going to first call your recursive tester function with "", passing the remaining available letters "helloworld" as a 2nd parameter. Function sees that "" isn't a word, but child "h" does exist. So it calls itself with "h", and "elloworld". Function sees that "h" isn't a word, but child "e" exists. So it calls itself with "he" and "lloworld". Function sees that "e" is marked, so "he" is a word, take note. Further, child "l" exists, so next call is "hel" with "loworld". It will next find "hell", then "hello", then will have to back out and probably next find "hollow", before backing all the way out to the empty string again and then starting with "e" words next.

  • I like this explanation. I can (almost) get my head around it. Need to do a pen and paper analysis I think. – PowerApp101 May 19 '09 at 02:49
  • Looks like it's pretty close to Norman Ramsey's DAWG (other post); should have known there was a formal definition for it. –  May 19 '09 at 14:15
  • Nope...I've thought about this and decided I don't understand the tester function. I understand how to build the tree. How does it get from "hello" to "hollow"? Recursion was never my strong point! – PowerApp101 May 21 '09 at 15:33
  • It won't get from "hello" to "hollow" directly. The call that received params of "h" and "elloworld" will subsequently make calls to all child nodes, which will be all 2-letter combinations which start words ("he" & "lloworld"; "ho" & "ellworld"; but not "hl" & anything, because that child will not exist). The call through "he" etc. will get to "hello", & eventually that recursion will exhaust itself. Then from "h" & "elloworld" again you'll call "ho" & "ellworld", and that call will eventually recursively reach "hollow" –  May 21 '09 at 16:28
  • How can this be done using a SQL database? The size of dictionary I want to use can't be handled by the language I'm using (PHP). – Joe Phillips Aug 26 '09 at 03:53
9

I couldn't resist my own implementation. It creates a dictionary by sorting all the letters alphabetically, and mapping them to the words that can be created from them. This is an O(n) start-up operation that eliminates the need to find all permutations. You could implement the dictionary as a trie in another language to attain faster speedups.

The "getAnagrams" command is also an O(n) operation which searches each word in the dictionary to see if it is a subset of the search. Doing getAnagrams("radiotelegraphically")" (a 20 letter word) took approximately 1 second on my laptop, and returned 1496 anagrams.

# Using the 38617 word dictionary at 
# http://www.cs.umd.edu/class/fall2008/cmsc433/p5/Usr.Dict.Words.txt
# Usage: getAnagrams("helloworld")

def containsLetters(subword, word):
    wordlen = len(word)
    subwordlen = len(subword)

    if subwordlen > wordlen:
        return False

    word = list(word)
    for c in subword:
        try:
            index = word.index(c)
        except ValueError:
            return False
        word.pop(index)
    return True

def getAnagrams(word):
    output = []
    for key in mydict.iterkeys():
        if containsLetters(key, word):
            output.extend(mydict[key])

    output.sort(key=len)
    return output

f = open("dict.txt")
wordlist = f.readlines()
f.close()

mydict = {}
for word in wordlist:
    word = word.rstrip()
    temp = list(word)
    temp.sort()
    letters = ''.join(temp)

    if letters in mydict:
        mydict[letters].append(word)
    else:
        mydict[letters] = [word]

An example run:

>>> getAnagrams("helloworld")
>>> ['do', 'he', 'we', 're', 'oh', 'or', 'row', 'hew', 'her', 'hoe', 'woo', 'red', 'dew', 'led', 'doe', 'ode', 'low', 'owl', 'rod', 'old', 'how', 'who', 'rho', 'ore', 'roe', 'owe', 'woe', 'hero', 'wood', 'door', 'odor', 'hold', 'well', 'owed', 'dell', 'dole', 'lewd', 'weld', 'doer', 'redo', 'rode', 'howl', 'hole', 'hell', 'drew', 'word', 'roll', 'wore', 'wool','herd', 'held', 'lore', 'role', 'lord', 'doll', 'hood', 'whore', 'rowed', 'wooed', 'whorl', 'world', 'older', 'dowel', 'horde', 'droll', 'drool', 'dwell', 'holed', 'lower', 'hello', 'wooer', 'rodeo', 'whole', 'hollow', 'howler', 'rolled', 'howled', 'holder', 'hollowed']
Unknown
  • 45,913
  • 27
  • 138
  • 182
  • This certainly gets the job done! But John Pirie's method is possibly more efficient? I'll play around with both. – PowerApp101 May 19 '09 at 08:14
  • @ 20th Century Boy: John Pirie describes some kind of trie, which I already suggested as a possible replacement to a hash table. However, it sounds incomplete. He describes walking the trie with H-e, which gives the word "he", but neglects the permutation "eh" (I think that's a word). – Unknown May 19 '09 at 08:22
  • 2
    @Unknown: I think "eh" would get detected after all the "h" words, ie when the function backtracks to the top of the tree and moves onto the branch starting with "e". – PowerApp101 May 19 '09 at 12:00
6

The data structure you want is called a Directed Acyclic Word Graph (dawg), and it is described by Andrew Appel and Guy Jacobsen in their paper "The World's Fastest Scrabble Program" which unfortunately they have chosen not to make available free online. An ACM membership or a university library will get it for you.

I have implemented this data structure in at least two languages---it is simple, easy to implement, and very, very fast.

Norman Ramsey
  • 198,648
  • 61
  • 360
  • 533
  • I found it on the first page of a Google search :-) – PowerApp101 May 19 '09 at 02:52
  • There is perhaps a faster one... see http://www.ericsink.com/downloads/faster-scrabble-gordon.pdf – Steve Powell May 26 '09 at 16:20
  • thanks @Norman and Steve to point out those amazing resources :) – Dzung Nguyen Feb 24 '15 at 15:02
  • 1
    The GADDAG is a faster algorithm for generating Scrabble moves, but not necessarily for just finding anagrams/subanagrams. They should be comparable, except that the DAWG might be faster since it's small enough to fit in cache memory. – Cesar Feb 05 '16 at 05:50
2

What you want is an implementation of a power set.

Also look at Eric Lipparts blog, he blogged about this very thing a little while back

EDIT:

Here is an implementation I wrote of getting the powerset from a given string...

private IEnumerable<string> GetPowerSet(string letters)
{
  char[] letterArray = letters.ToCharArray();
  for (int i = 0; i < Math.Pow(2.0, letterArray.Length); i++)
  {
    StringBuilder sb = new StringBuilder();
    for (int j = 0; j < letterArray.Length; j++)
    {
      int pos = Convert.ToInt32(Math.Pow(2.0, j));
      if ((pos & i) == pos)
      {
        sb.Append(letterArray[j]);
      }
    }
    yield return new string(sb.ToString().ToCharArray().OrderBy(c => c).ToArray());
  }
}

This function gives me the powersets of chars that make up the passed in string, I then can use these as keys into a dictionary of anagrams...

Dictionary<string,IEnumerable<string>>

I created my dictionary of anagrams like so... (there are probably more efficient ways, but this was simple and plenty quick enough with the scrabble tournament word list)

wordlist = (from s in fileText.Split(new string[] { Environment.NewLine }, StringSplitOptions.RemoveEmptyEntries)
                let k = new string(s.ToCharArray().OrderBy(c => c).ToArray())
                group s by k).ToDictionary(o => o.Key, sl => sl.Select(a => a));
Tim Jarvis
  • 18,465
  • 9
  • 55
  • 92
2

A simple-minded approach is to generate all the "substrings" and, for each of them, check whether it's an element of the set of acceptable words. E.g., in Python 2.6:

import itertools
import urllib

def words():
  f = urllib.urlopen(
    'http://www.cs.umd.edu/class/fall2008/cmsc433/p5/Usr.Dict.Words.txt')
  allwords = set(w[:-1] for w in f)
  f.close()
  return allwords

def substrings(s):
  for i in range(2, len(s)+1):
    for p in itertools.permutations(s, i):
      yield ''.join(p)

def main():
  w = words()
  print '%d words' % len(w)
  ss = set(substrings('weep'))
  print '%d substrings' % len(ss)
  good = ss & w
  print '%d good ones' % len(good)
  sgood = sorted(good, key=lambda w:(len(w), w))
  for aword in sgood:
    print aword

main()

will emit:

38617 words
31 substrings
5 good ones
we
ewe
pew
wee
weep

Of course, as other responses pointed out, organizing your data purposefully can greatly speed-up your runtime -- although the best data organization for a fast anagram finder could well be different... but that will largely depend on the nature of your dictionary of allowed words (a few tens of thousands, like here -- or millions?). Hash-maps and "signatures" (based on sorting the letters in each word) should be considered, as well as tries &c.

Alex Martelli
  • 854,459
  • 170
  • 1,222
  • 1,395
  • 1
    This works for a small test string, but a 16-character one yields 20922789888000 possible substrings. For long test strings, you want to be bound by valid entries, rather than possible letter combinations. –  May 19 '09 at 02:06
0

I've been playing a lot of Wordfeud on my phone recently and was curious if I could come up with some code to give me a list of possible words. The following code takes your availble source letters (* for a wildcards) and an array with a master list of allowable words (TWL, SOWPODS, etc) and generates a list of matches. It does this by trying to build each word in the master list from your source letters.

I found this topic after writing my code, and it's definitely not as efficient as John Pirie's method or the DAWG algorithm, but it's still pretty quick.

public IList<string> Matches(string sourceLetters, string [] wordList)
{
    sourceLetters = sourceLetters.ToUpper();

    IList<string> matches = new List<string>();

    foreach (string word in wordList)
    {
        if (WordCanBeBuiltFromSourceLetters(word, sourceLetters))
            matches.Add(word);
    }

    return matches;
}


public bool WordCanBeBuiltFromSourceLetters(string targetWord, string sourceLetters)
{
    string builtWord = "";

    foreach (char letter in targetWord)
    {
        int pos = sourceLetters.IndexOf(letter);
        if (pos >= 0)
        {
            builtWord += letter;
            sourceLetters = sourceLetters.Remove(pos, 1);
            continue;
        }


        // check for wildcard
        pos = sourceLetters.IndexOf("*");
        if (pos >= 0)
        {
            builtWord += letter;
            sourceLetters = sourceLetters.Remove(pos, 1);
        }


    }

    return string.Equals(builtWord, targetWord);

}
Pete Nelson
  • 1,328
  • 1
  • 13
  • 23
0

Like Tim J, Eric Lippert's blog posts where the first thing to come to my mind. I wanted to add that he wrote a follow-up about ways to improve the performance of his first attempt.

Community
  • 1
  • 1
Lucas
  • 17,277
  • 5
  • 45
  • 40
0

I believe the Ruby code in the answers to this question will also solve your problem.

Community
  • 1
  • 1
las3rjock
  • 8,612
  • 1
  • 31
  • 33