1

I need to find all English words which can be formed from the letters in a string

 sentence="Ziegler's Giant Bar"

I can make an array of letters by

 sentence.split(//)  

How can I make more than 4500 English words from the sentence in Ruby?

[edit]

It may be best to split the problem into parts:

  1. to make only an array of words with 10 letters or less
  2. the longer words can be looked up separately
rampion
  • 87,131
  • 49
  • 199
  • 315
Léo Léopold Hertz 준영
  • 134,464
  • 179
  • 445
  • 697
  • 1
    question was ambiguous. Your result after puts() was not clear. Can you explain more? – ecleel May 09 '09 at 09:40
  • "can be formed from the letters in" is still ambiguous in that it's not clear if replacement is allowed. For instance: the word "aa" can be formed from text that included the letter "a", but only if you're allowed to use it more than once. Using /usr/share/dict/words and that string I find either 3182 or 6725 words, depending on which answer was the intent. – glenra May 18 '09 at 07:02
  • 1
    Aha! Knuth's dictionary can be downloaded here ("knuth_words.gz") http://www.packetstormsecurity.org/Crackers/wordlists/dictionaries/ . It's about half the size of the current standard UNIX dictionary. If you allow replacement, I get 4514 words using that dictionary - see my answer below. – glenra May 18 '09 at 08:03

4 Answers4

8

[Assuming you can reuse the source letters within one word]: For each word in your dictionary list, construct two arrays of letters - one for the candidate word and one for the input string. Subtract the input array-of-letters from the word array-of-letters and if there weren't any letters left over, you've got a match. Code to do that looks like this:

def findWordsWithReplacement(sentence)
    out=[]
    splitArray=sentence.downcase.split(//)
    `cat /usr/share/dict/words`.each{|word|
        if (word.strip!.downcase.split(//) - splitArray).empty?
            out.push word
        end
     }
     return out
end

You can call that function from the irb debugger like so:

output=findWordsWithReplacement("some input string"); puts output.join(" ")

...or here's a wrapper you could use to call the function interactively from a script:

puts "enter the text."
ARGF.each {|line|
    puts "working..."
    out=findWordsWithReplacement(line)
    puts out.join(" ")
    puts "there were #{out.size} words."
}

When running this on a Mac, the output looks like this:

$ ./findwords.rb
enter the text.
Ziegler's Giant Bar
working...
A a aa aal aalii Aani Ab aba abaiser abalienate Abantes Abaris abas abase abaser Abasgi abasia Abassin abatable abate abater abatis abaze abb Abba abbas abbasi abbassi abbatial abbess Abbie Abe abear Abel abele Abelia Abelian Abelite abelite abeltree Aberia aberrant aberrate abet abettal Abie Abies abietate abietene abietin Abietineae Abiezer Abigail abigail abigeat abilla abintestate
[....]
Z z za Zabaean zabeta Zabian zabra zabti zabtie zag zain Zan zanella zant zante Zanzalian zanze Zanzibari zar zaratite zareba zat zati zattare Zea zeal zealless zeallessness zebra zebrass Zebrina zebrine zee zein zeist zel Zelanian Zeltinger Zen Zenaga zenana zer zest zeta ziara ziarat zibeline zibet ziega zieger zig zigzag zigzagger Zilla zing zingel Zingiber zingiberene Zinnia zinsang Zinzar zira zirai Zirbanit Zirian Zirianian Zizania Zizia zizz
there were 6725 words.

That is well over 4500 words, but that's because the Mac word dictionary is pretty large. If you want to reproduce Knuth's results exactly, download and unzip Knuth's dictionary from here: http://www.packetstormsecurity.org/Crackers/wordlists/dictionaries/knuth_words.gz and replace "/usr/share/dict/words" with the path to wherever you've unpacked the substitute directory. If you did it right you'll get 4514 words, ending in this collection:

zanier zanies zaniness Zanzibar zazen zeal zebra zebras Zeiss zeitgeist Zen Zennist zest zestier zeta Ziegler zig zigging zigzag zigzagging zigzags zing zingier zings zinnia

I believe that answers the original question.

Alternatively, the questioner/reader might have wanted to list all the words one can construct from a string without reusing any of the input letters. My suggested code to accomplish that works as follows: Copy the candidate word, then for each letter in the input string, destructively remove the first instance of that letter from the copy (using "slice!"). If this process absorbs all the letters, accept that word.

def findWordsNoReplacement(sentence)
    out=[]
    splitInput=sentence.downcase.split(//)
    `cat /usr/share/dict/words`.each{|word|
        copy=word.strip!.downcase
        splitInput.each {|o| copy.slice!(o) }
        out.push word if copy==""
     }
     return out
end
glenra
  • 1,870
  • 2
  • 11
  • 11
  • Array subtraction is a bit more costly than using a regex, but it works. Plus, good job finding Knuth's dictionary! – rampion May 18 '09 at 14:37
  • Thanks! Yeah, your solution is probably faster, but mine's easier to read. – glenra May 18 '09 at 17:48
3

If you want to find words whose letters and frequency thereof are restricted by the given phrase, you can construct a regex to do this for you:

sentence = "Ziegler's Giant Bar"

# count how many times each letter occurs in the 
# sentence (ignoring case, and removing non-letters)
counts = Hash.new(0)
sentence.downcase.gsub(/[^a-z]/,'').split(//).each do |letter|
  counts[letter] += 1
end
letters = counts.keys.join
length = counts.values.inject { |a,b| a + b }

# construct a regex that matches upto that many occurences
# of only those letters, ignoring non-letters
# (in a positive look ahead)
length_regex = /(?=^(?:[^a-z]*[#{letters}]){1,#{length}}[^a-z]*$)/i
# construct regexes that matches each letter up to its
# proper frequency (in a positive look ahead)
count_regexes = counts.map do |letter, count|
  /(?=^(?:[^#{letter}]*#{letter}){0,#{count}}[^#{letter}]*$)/i
end

# combine the regexes, to form a regex that will only
# match words that are made of a subset of the letters in the string
regex = /#{length_regex}#{count_regexes.join('')}/

# open a big file of words, and find all the ones that match
words = File.open("/usr/share/dict/words") do |f|
  f.map { |word| word.chomp }.find_all { |word| regex =~ word }
end

words.length #=> 3182
words #=> ["A", "a", "aa", "aal", "aalii", "Aani", "Ab", "aba", "abaiser", "Abantes",
          "Abaris", "abas", "abase", "abaser", "Abasgi", "abate", "abater", "abatis",
          ...
          "ba", "baa", "Baal", "baal", "Baalist", "Baalite", "Baalize", "baar", "bae",
          "Baeria", "baetzner", "bag", "baga", "bagani", "bagatine", "bagel", "bagganet",
          ...
          "eager", "eagle", "eaglet", "eagre", "ean", "ear", "earing", "earl", "earlet",
          "earn", "earner", "earnest", "earring", "eartab", "ease", "easel", "easer",
          ...
          "gab", "Gabe", "gabi", "gable", "gablet", "Gabriel", "Gael", "gaen", "gaet",
          "gag", "gagate", "gage", "gageable", "gagee", "gageite", "gager", "Gaia",
          ...
          "Iberian", "Iberis", "iberite", "ibis", "Ibsenite", "ie", "Ierne", "Igara",
          "Igbira", "ignatia", "ignite", "igniter", "Ila", "ilesite", "ilia", "Ilian",
          ...
          "laang", "lab", "Laban", "labia", "labiate", "labis", "labra", "labret", "laet",
          "laeti", "lag", "lagan", "lagen", "lagena", "lager", "laggar", "laggen",
          ...
          "Nabal", "Nabalite", "nabla", "nable", "nabs", "nae", "naegate", "naegates",
          "nael", "nag", "Naga", "naga", "Nagari", "nagger", "naggle", "nagster", "Naias",
          ...
          "Rab", "rab", "rabat", "rabatine", "Rabi", "rabies", "rabinet", "rag", "raga",
          "rage", "rager", "raggee", "ragger", "raggil", "raggle", "raging", "raglan",
          ...
          "sa", "saa", "Saan", "sab", "Saba", "Sabal", "Saban", "sabe", "saber",
          "saberleg", "Sabia", "Sabian", "Sabina", "sabina", "Sabine", "sabine", "Sabir",
          ...
          "tabes", "Tabira", "tabla", "table", "tabler", "tables", "tabling", "Tabriz",
          "tae", "tael", "taen", "taenia", "taenial", "tag", "Tagabilis", "Tagal",
          ...
          "zest", "zeta", "ziara", "ziarat", "zibeline", "zibet", "ziega", "zieger",
          "zig", "zing", "zingel", "Zingiber", "zira", "zirai", "Zirbanit", "Zirian"]

Positive lookaheads let you make a regex that matches a position in the string where some specified pattern matches without consuming the part of the string that matches. We use them here to match the same string against multiple patterns in a single regex. The position only matches if all our patterns match.

If we allow infinite reuse of letters from the original phrase (like Knuth did according to glenra's comment), then it's even easier to construct a regex:

sentence = "Ziegler's Giant Bar"

# find all the letters in the sentence
letters = sentence.downcase.gsub(/[^a-z]/,'').split(//).uniq

# construct a regex that matches any line in which
# the only letters used are the ones in the sentence
regex = /^([^a-z]|[#{letters.join}])*$/i

# open a big file of words, and find all the ones that match
words = File.open("/usr/share/dict/words") do |f|
  f.map { |word| word.chomp }.find_all { |word| regex =~ word }
end

words.length #=> 6725
words #=> ["A", "a", "aa", "aal", "aalii", "Aani", "Ab", "aba", "abaiser", "abalienate",
           ...
           "azine", "B", "b", "ba", "baa", "Baal", "baal", "Baalist", "Baalite",
           "Baalize", "baar", "Bab", "baba", "babai", "Babbie", "Babbitt", "babbitt",
           ...
           "Britannian", "britten", "brittle", "brittleness", "brittling", "Briza",
           "brizz", "E", "e", "ea", "eager", "eagerness", "eagle", "eagless", "eaglet",
           "eagre", "ean", "ear", "earing", "earl", "earless", "earlet", "earliness",
           ...
           "eternalize", "eternalness", "eternize", "etesian", "etna", "Etnean", "Etta",
           "Ettarre", "ettle", "ezba", "Ezra", "G", "g", "Ga", "ga", "gab", "gabber",
           "gabble", "gabbler", "Gabe", "gabelle", "gabeller", "gabgab", "gabi", "gable",
           ...
           "grittiness", "grittle", "Grizel", "Grizzel", "grizzle", "grizzler", "grr",
           "I", "i", "iba", "Iban", "Ibanag", "Iberes", "Iberi", "Iberia", "Iberian",
           ...
           "itinerarian", "itinerate", "its", "Itza", "Izar", "izar", "izle", "iztle",
           "L", "l", "la", "laager", "laang", "lab", "Laban", "labara", "labba", "labber",
           ...
           "litter", "litterer", "little", "littleness", "littling", "littress", "litz",
           "Liz", "Lizzie", "Llanberisslate", "N", "n", "na", "naa", "Naassenes", "nab",
           "Nabal", "Nabalite", "Nabataean", "Nabatean", "nabber", "nabla", "nable",
           ...
           "niter", "nitraniline", "nitrate", "nitratine", "Nitrian", "nitrile",
           "nitrite", "nitter", "R", "r", "ra", "Rab", "rab", "rabanna", "rabat",
           "rabatine", "rabatte", "rabbanist", "rabbanite", "rabbet", "rabbeting",
           ...
           "riteless", "ritelessness", "ritling", "rittingerite", "rizzar", "rizzle", "S",
           "s", "sa", "saa", "Saan", "sab", "Saba", "Sabaean", "sabaigrass", "Sabaist",
           ...
           "strigine", "string", "stringene", "stringent", "stringentness", "stringer",
           "stringiness", "stringing", "stringless", "strit", "T", "t", "ta", "taa",
           "Taal", "taar", "Tab", "tab", "tabaret", "tabbarea", "tabber", "tabbinet",
           ...
           "tsessebe", "tsetse", "tsia", "tsine", "tst", "tzaritza", "Tzental", "Z", "z",
           "za", "Zabaean", "zabeta", "Zabian", "zabra", "zabti", "zabtie", "zag", "zain",
           ...
           "Zirian", "Zirianian", "Zizania", "Zizia", "zizz"]
Community
  • 1
  • 1
rampion
  • 87,131
  • 49
  • 199
  • 315
  • @@rampion: There must be something wrong in the algorithm, since Knuth got about 60 years ago about 4500 words as a result. Does the dictionary have all English words? – Léo Léopold Hertz 준영 May 09 '09 at 16:41
  • Well, what dictionary was Knuth using? If we're using different dictionaries, we'll have different results. The English language has changed quite a bit since Knuth, so if you want the same results as Knuth, I'll need the same dictionary. – rampion May 09 '09 at 19:33
  • 1
    I believe Knuth was using a *smaller* dictionary than the modern one - and I've found it!...but he was allowing letter reuse. – glenra May 18 '09 at 07:59
1

I don't think that Ruby has an English dictionary. But you could try to store all permutations of the original string in an array, and check those strings against Google? Say that a word is actually a word, if has more than 100.000 hits or something?

tmadsen
  • 941
  • 7
  • 14
  • There's a dictionary lookup service here: http://services.aonaware.com/DictService/ but that won't work with most proper nouns. – DanSingerman May 09 '09 at 11:19
1

You can get an array of letters like so:

sentence = "Ziegler's Giant Bar"
letters = sentence.split(//)
August Lilleaas
  • 54,010
  • 13
  • 102
  • 111