How to optimize my hashtable to reduce real world running time?

Question

Below is a piece of my program which loads a file(dictionary) into memory using hashtables. The dictionary contains only 1 word per line. But the process is taking too much time. How do I optimize it ??

 bool load(const char* dictionary)
{
    // TODO
    int k;
    FILE* fp = fopen(dictionary,"r");
    if(fp == NULL)
        return false;

    for(int i=0; i<26; i++)
    {
        hashtable[i] = NULL;
    }

    while(true)
    {
        if(feof(fp))
            return true;

        node* n = malloc(sizeof(node));

        n->pointer = NULL;

        fscanf(fp,"%s",n->word);

        if(isalpha(n->word[0]))
        {
            k = hashfunction(n->word);
        }

        else return true;

        if(hashtable[k] == NULL)
        {
            hashtable[k] = n;
            total_words++;
        }

        else
        {
            node* traverse = hashtable[k];
            while(true)
            {
                if(traverse->pointer == NULL)
                {
                    traverse->pointer = n;
                    total_words++;
                    break;
                }
                traverse = traverse->pointer;
            }
        }

    }
   return false; 
}

And how many words (lines) is it in the file? What does `hashfunction` do? Have you tried increasing the number of buckets so you don't need so many list traversals? But most importantly, have you tried using a *profiler* to find out where the bottlenecks are? — Some programmer dude, Aug 18 '16 at 06:01
Unrelated to your problem, but your reading-loop is not much different from `while (!feof(fp))`, [which is wrong](http://stackoverflow.com/questions/5431941/why-is-while-feof-file-always-wrong). — Some programmer dude, Aug 18 '16 at 06:02
This is "working" code. Post this on [codereview.stackexchange.com](http://codereview.stackexchange.com) if you want ideas on how to make it "better". And when doing so, you should include your hash function and driver-code, along with notes and measurements for whatever means you're formally testing this with. — WhozCraig, Aug 18 '16 at 06:03
@WhozCraig It is not complete code though, so it can't be answered neither there or here. — Lundin, Aug 18 '16 at 06:34
Unrelated to the question, please read [Why is “while ( !feof (file) )” always wrong?](http://stackoverflow.com/questions/5431941/why-is-while-feof-file-always-wrong). — Lundin, Aug 18 '16 at 06:34
After the `load` function is finished, go through the hash table, and count how many items are in each list. The counts should all be small numbers, i.e. single digit numbers. Otherwise, you need to make the table bigger and/or improve the hash function. To put some numbers on it, if the input file has 100,000 words in it, then you could use a 32,768 entry hash table if the hash function is really good. Or use a 262,144 entry hash table if your hash function is not so good. — user3386109, Aug 18 '16 at 06:42
@Lundin it doesn't belong here at all, and if the OP includes the laundry list mentioned when posting to CR, it will be a far better fit than here. I concur, if they do not (i.e. if they just cut and paste this to there), it belongs on *neither* site. — WhozCraig, Aug 18 '16 at 06:52

score 1 · Accepted Answer · answered Aug 18 '16 at 17:12

Get rid of potential functional problems, then worry about performance.

A) for(int i=0; i<26; i++) may be wrong, hashtable[] definition not posted. It is certainly unwise for performance to use such a small fixed table.

B) "%s" is as safe as gets() - both are bad. Instead of fscanf(fp,"%s",n->word);, use fgets().

C) Instead of if(feof(fp)), check return value from fscanf()/fgets().

D) isalpha(n->word[0]) --> isalpha((unsigned char) n->word[0]) to cope with negative char values.

E) Check for memory allocation failure.

F) Other issues may also exists depending on unposted code.

Then form a simple test case and with minimal code that works, consider posting on codereview.stackexchange.com to solicit performance improvements.

Good answer. Remember to [remove the newline from the buffer](https://stackoverflow.com/questions/2693776/removing-trailing-newline-character-from-fgets-input) when using `fgets`. — user3386109, Aug 18 '16 at 18:34

score 0 · Answer 2 · answered Aug 19 '16 at 05:10

You are making the assumption that all the words in the file are distinct. That is a reasonable assumption for a dictionary but it is bad defensive programming. You should always assume that input is out to get you, which means you cannot really assume anything about it.

In this case, though, you could argue that repeated words in the hashtable do not prevent it from working; they just slow it down slightly. Since the erroneous input won't cause bugs, undefined behaviour, or other catastrophes, it is marginally acceptable to document the requirement that reference words be unique.

Anyway, if you are not actually checking for duplicates, there is no need to walk the entire hash bucket for every insertion. If you insert new entries at the beginning of the bucket rather than at the end, you can avoid the scan which will probably yield a noticeable speedup if the buckets are large.

Of course, that optimization can only be used when loading the dictionary. It won't help you use the hashtable once initialization is complete, and it is rarely worthwhile hyper-optimizing startup code.

How to optimize my hashtable to reduce real world running time?

2 Answers2