3

My title got edited, so I wanted to make sure everyone knows this is homework. The problem is just to optimize the program, the hashing is my idea.

--

I'm working on optimizing a C program that groups together words that are anagrams of each other, and then prints them out.

Currently the program is basically a linked list of linked lists. Each link in the outer list is a group of words that are anagrams of each other.

The profile for the program shows that by far, the largest portion of execution time is the function wordLookup. This is because it has to search every node, and with a possible 100k words read in from a file, this can take a very long time. For instance, here is the gprof output for reading in 40k words:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls  us/call  us/call  name
100.31      1.48     1.48    40000    37.12    37.12  wordLookup
  0.00      1.48     0.00    78235     0.00     0.00  newnode
  0.00      1.48     0.00    40000     0.00     0.00  sort_string
  0.00      1.48     0.00    38235     0.00     0.00  wordInsert
  0.00      1.48     0.00     1996     0.00     0.00  swap_words
  0.00      1.48     0.00     1765     0.00     0.00  wordAppend

My idea for making this faster is to change the data structure to a hash table that chains all anagrams of each other in the same slot.

Based on things my professor has said and things that I've read here, I'm thinking of something like this for my hash function. (Note: the prime numbers are distributed such that the most used letters are low numbers and the least used are high numbers.)

sort(string)

array alpha_primes = 5,71,37,29,2,53,59,19,11,83,79,31,43,13,7,67,97,23,17,3,41,73,47,89,61,101
hash(String) {
  hash = 1
  for (char in String) {
    hash *= alpha_primes[char-'a'];
  }
  return hash % tablesize
}

Is there a hash table size for this problem that would appropriately distribute the values such that each group of anagrams has a distinct index in the table?

If that is not possible, then should I:

  • chain the lists of words together (list of lists)
  • use a probing (linear or quadratic) solution
  • For either of these scenarios, what are the upsides/downsides when compared?
user677786
  • 455
  • 1
  • 6
  • 13
  • 2
    I'd personally use a trie, using words with the letters sorted alphabetically as lookup keys. No worrying about hash collisions, fast lookups and insertions. Slightly more complex to do in C than a hash table, but not impossibly so. – Cairnarvon Apr 13 '13 at 23:22
  • They key idea is to sort the letters of the word. Then a simple hash table is going to give you a very good performance. The trie is just the icing on the cake - with a naive implementation chances are it will give you worse performance. – Karoly Horvath Apr 13 '13 at 23:30
  • So if I had a set of words: cat, act, god, dog, bet, teb, the trie would look something like this? http://textsnip.com/d93a4d - couldn't pasted it formatted here. – user677786 Apr 13 '13 at 23:47
  • You should delete 2 from your list as it is not co-prime with integer overflow range. All they do is shift the result left until (after 32 of them) you get stuck at zero. If all factors are odd then only 1 lsb is junk. – sh1 Aug 10 '13 at 01:38

1 Answers1

1

There is no way to guarantee that hashes will be unique. The probability of a collision can be calculated via the birthday problem, and your best bet is to minimize it.

The probability for 2 groups to hash to the same value can be approximated as 1-e^((-k(k-1))/2n), where k is the total amount of groups you have (roughly the same as your word count), and n is the search space of your hash (2^(length of your hash)).

My dictionary has roughly 100000 words, making a 32b hash very good (2% of colissions). However, a hash table that big would use 4GB of RAM. Using a smaller table means more colissions. Chaining or probing won't make a huge difference in time.

As reccomended in comments to your question, a trie will end up in a smaller data structure overall.

Tordek
  • 10,628
  • 3
  • 36
  • 67
  • 1
    Thanks for the info! I decided to go with hashing, and it worked out really well! I have put it in my todo list to look into trie structures more, though! – user677786 Apr 14 '13 at 06:52