Data structure for soundex algorithm?

Question

Can anyone suggest me on what data structure to use for a soundex algorithm program? The language to be used is Java. If anybody has worked on this before in Java. The program should have these features: be able to read about 50,000 words should be able to read a word and return the related words having the same soundex

I don't want the program implementation just few advices on what data structure to use.

score 3 · Answer 1 · edited Nov 07 '08 at 06:17

3

TIP: If you use SQL as a databackend then you can let SQL handle it with the two sql-functions SOUNDEX and DIFFERENCE.

Maybe not what you wanted, but many people do not know that MSsql has those two functions.

edited Nov 07 '08 at 06:17

Jonathan Leffler

730,956
141
904
1,278

answered Nov 06 '08 at 23:34

Stefan

11,423
8
50
75

Edward Kmett · Answer 2 · 2008-11-06T23:45:23.357

2

Well soundex can be implemented in a straightforward pass over a string, so that doesn't require anything special.

After that the 4 character code can be treated as an integer key.

Then just build a dictionary that stores word sets indexed by that integer key. 50,000 words should easily fit into memory so nothing fancy is required.

Then walk the dictionary and each bucket is a group of similar sounding words.

Actually, here is the whole program in perl:

#!/usr/bin/perl
use Text::Soundex;
use Data::Dumper;
open(DICT,"</usr/share/dict/linux.words");
my %dictionary = ();
while (<DICT>) {
        chomp();
        chomp();
        push @{$dictionary{soundex($_)}},$_;
}
close(DICT);
while (<>) {
        my @words = split / +/;
        foreach (@words) {
            print Dumper $dictionary{soundex($_)};
        }
}

edited Nov 06 '08 at 23:45

answered Nov 06 '08 at 23:35

Edward Kmett

29,632
7
85
107

Why the double chomp? (Is it to remove both CR and LF in DOS files?) – Jonathan Leffler Nov 07 '08 at 06:18
Yeah, without it in DOS it would break otherwise. – Edward Kmett Nov 07 '08 at 12:49
Er, after writing this I realized that the question was tagged Java. But I'm leaving this here because I think its informative – Edward Kmett Nov 07 '08 at 12:50

score 1 · Answer 3 · answered Nov 06 '08 at 23:34

I believe you just need to convert the original strings into soundex keys into a hashtable; the value for each entry in the table would be a collection of original strings mapping to that soundex.

The MultiMap collection interface (and its implementations) in Google Collections would be useful to you.

score 1 · Answer 4 · answered Nov 07 '08 at 00:13

class SpellChecker
{

  interface Hash {
    String hash(String);
  }

  private final Hash hash;

  private final Map<String, Set<String>> collisions;

  SpellChecker(Hash hash) {
    this.hash = hash;
    collisions = new TreeSet<String, Set<String>>();
  }

  boolean addWord(String word) {
    String key = hash.hash(word);
    Set<String> similar = collisions.get(key);
    if (similar == null)
      collisions.put(key, similar = new TreeSet<String>());
    return similar.add(word);
  }

  Set<String> similar(String word) {
    Set<String> similar = collisions.get(hash.hash(word));
    if (similar == null)
      return Collections.emptySet();
    else
      return Collections.unmodifiableSet(similar);
  }

}

The hash strategy could be Soundex, Metaphone, or what have you. Some strategies might be tunable (how many characters does it output, etc.)

score 0 · Answer 5 · answered Nov 06 '08 at 23:35

0

Since soundex is a hash, I'd use a hash table, with the soundex as the key.

answered Nov 06 '08 at 23:35

warren

32,620
21
85
124

Thanks warren for a precise answer but what abaout the value? a linked list? a binary search tree? an array? or something else? and what about hashmap instead of hash table? – javac Nov 06 '08 at 23:37
A hashtable *is* a data structure... or at least I was pretty sure it was :) .. with 50000 words, I think *which* structure you pick won't be too horribly important; if you were talking 5000000, then it certainly could be – warren Nov 06 '08 at 23:49

score 0 · Answer 6 · answered Jan 01 '09 at 15:48

you want a 4-byte integer.

The soundex algorithm always returns a 4-character code, if you use ANSI inputs, you'll get 4-bytes back (represented as 4 letters).

So store the codes returned in a hashtable, convert your word to the code and look it up in the hashtable. Its really that easy.

Data structure for soundex algorithm?

6 Answers6

Linked