Script to find words inside a given word from wordlist

Question

I have a dictionary with 250K words (txt file). For each of those words I would like to come up with a script that will throw all possible anagrams (each anagram should also be in the dictionary).

Ideally the script would output in this format:

word1: anagram1,anagram2...

word2: anagram1,anagram2...

Any help would be greatly appreciated.

I tried some scripts from this page: http://rosettacode.org/wiki/Anagrams but they all find anagrams which are the same word length. What I'm looking for is any word formations within the given word. We can have anagrams with anywhere from 2 chars to 7 chars for a 7 char word — peace4theapes, Oct 10 '12 at 23:49
Any word or phrase that exactly reproduces the letters in another order is an anagram. What you want are not anagrams. — Steve, Oct 10 '12 at 23:53
Steve - That might be true. I'm not sure what to call them, that's the closest I got. — peace4theapes, Oct 10 '12 at 23:54
No worries, but perhaps a title like "find words inside words" — Steve, Oct 10 '12 at 23:55
Sorry guys, these solutions are flying over my head. Maybe someone could provide me a link to a script or something? — peace4theapes, Oct 11 '12 at 00:11
We aren't here to write your program, but we will show you the way. — the Tin Man, Oct 11 '12 at 00:15
@peace4theapes: Are you expecting words to be made up of combinations of letters from other words? If all of the following words are in your dictionary and given the word "orchestra", would you expect to find "carthorse", "cart", "art", horse" etc or simply, "cart", "art", horse" etc? — Steve, Oct 11 '12 at 01:39
steve - it would be all words. any word which is formed from the same characters or subset of characters — peace4theapes, Oct 11 '12 at 16:48
sputnick - the script seems to be running, but I don't see any output. I also tried writing the output to a file. It just keeps running and the output file is always empty. Am I doing something wrong? — peace4theapes, Oct 11 '12 at 20:49

score 1 · Answer 1 · edited May 23 '17 at 11:48

It must be anagram week.

I'm going to refer you to an answer I submitted to a prior question: https://stackoverflow.com/a/12811405/128421. It shows how to build a hash for quick searches of words that have common letters.

For your purpose, of finding substrings/inner-words, you will also want to find the possible inner words. Here's how to quickly locate unique combinations of letters of varying sizes, based on a starting word:

word = 'misses'
word_letters = word.downcase.split('').sort
3.upto(word.length) { |i| puts word_letters.combination(i).map(&:join).uniq }

eim
eis
ems
ess
ims
iss
mss
sss
eims
eiss
emss
esss
imss
isss
msss
eimss
eisss
emsss
imsss
eimsss

Once you have those combinations, split them (or don't do the join) and do look-ups in the hash my previous answer built.

score 1 · Answer 2 · edited May 23 '17 at 12:27

Inspired by this, I would suggest you create a Trie.

Then, the trie with N levels will have all possible anagrams (where N is the length of the original word). Now, to get different sized words, I suggest you simply traverse the trie, ie. for all 3 letter subwords, just make all strings that are 3 levels deep in the trie.

I'm not really sure of this, because I didn't test this, but it's an interesting challenge, and this suggestion would be how I would start tackling it.

Hope it helps a little =)

score 0 · Answer 3 · answered Oct 11 '12 at 01:06

0

h = Hash.new{[]}
array_of_words.each{|w| h[w.downcase.chars.sort].push(w)}
h.values

answered Oct 11 '12 at 01:06

sawa

165,429
45
277
381

Gilles Quénot · Accepted Answer · 2012-10-11T03:50:31.087

What I tried so far in Perl :

use strict;
use warnings;

use Algorithm::Combinatorics qw(permutations);

die "First argument should be a dict\n" unless $ARGV[0] or die $!;
open my $fh, "<", $ARGV[0] or die $!;

my @arr = <$fh>;
my $h = {};

map { chomp; $h->{lc($_)} = [] } @arr;

foreach my $word (@arr) {
    $word = lc($word);
    my $chars = [ ( $word =~ m/./g ) ];
    my $it = permutations($chars);

    while ( my $p = $it->next ) {
        my $str = join "", @$p;

        if ($str ne $word && exists $h->{$str}) { 
            push @{ $h->{$word} }, $str
                unless grep { /^$str$/ } @{ $h->{$word} };
        }
    }

    if (@{ $h->{$word} }) {
        print "$word\n";
        print "\t$_\n" for @{ $h->{$word} };
    }
}

END{ close $fh; }

There's maybe some possible improvement for speed, but it works.

I use French dict from words archlinux package.

EXAMPLE

$ perl annagrammes.pl /usr/share/dict/french
abaissent
        absentais
        abstenais
abaisser
        baissera
        baserais
        rabaisse
(...)

NOTE To installl the perl module :

cpan -i Algorithm::Combinatorics

Script to find words inside a given word from wordlist

4 Answers4