5

i've done in the past a small exercise about hashtable but the user was giving array size and also the struct was like this (so the user was giving a number and a word each time as input)

struct data
{
   int key;
   char c[20];
};

So it was quite simple since i knew the array size and also the user was saying how many items he will be give as input. The way i did it was

  • Hashing the keys the user gave me
  • find the position array[hashed(key)] in the array
  • if it was empty i would put the data there
  • if it wasn't i would put it in the next free position i would find.

But now i have to make inverted index and i am reasearching so i can make a hashtable for it. So the words will be collected from around 30 txts and they will be so many. So in this case how long should the array be? How can i hash words? Should i use hasing with open adressing or with chaining. The exercise sais that we could use a hash table as it is if we find it online. but i prefer to understand and create it by my own. Any clues will help me :)

In this exerice(inverted index using hash table) the structs looks like this. data type is the type of the hash table i will create.

struct posting
{
    string word;
    posting *next
}

struct data
{
    string word;
    posting *ptrpostings;
    data *next
};
captain monk
  • 719
  • 4
  • 11
  • 34

2 Answers2

4

Hashing can be done anyway you choose. Suppose that the string is ABC. You can employ hashing as A=1, B=2, C=3, Hash = 1+2+3/(length = 3) = 2. But, this is very primitive.

The size of the array will depend on the hash algorithm that you deploy, but it is better to choose an algorithm that returns a definite length hash for every string. For example, if you chose to go with SHA1, you can safely allocate 40 bytes per hash. Refer Storing SHA1 hash values in MySQL Read up on the algorithm http://en.wikipedia.org/wiki/SHA-1. I believe that it can be safely used.

On the other hand, if it just for a simple exercise, you can also use MD5 hash. I wouldn't recommend using it in practical purposes as its rainbow tables are available easily :)

---------EDIT-------

You can try to implement like this ::

#include <iostream>
#include <string>
#include <stdlib.h>
#include <stdio.h>
#define MAX_LEN 30
using namespace std;

typedef struct
{
    string name; // for the filename
    ... change this to your specification
}hashd;

hashd hashArray[MAX_LEN]; // tentative

int returnHash(string s)
{
    // A simple hashing, no collision handled
    int sum=0,index=0;
    for(string::size_type i=0; i < s.length(); i++)
    {
        sum += s[i];
    }
    index = sum % MAX_LEN;
    return index;
}

int main()
{
    string fileName;
    int index;
    cout << "Enter filename        ::\t" ;
    cin >> fileName;
    cout << "Enter filename is     ::\t" + fileName << "\n";
    index = returnHash(fileName);
    cout << "Generated index is ::\t" << index << "\n";
    hashArray[index].name = fileName;
    cout << "Filename in array    ::\t" <<hashArray[index].name ;
    return 0;
}

Then, to achieve O(1), anytime you want to fetch the filename's contents, just run the returnHash(filename) function. It will directly return the index of the array :)

Community
  • 1
  • 1
  • First of all thank you for your answear :). Now i checked an MD5 generator and for the string caesar it returns b712916d8bfc1718a431c7b4fa280ae6 . How will i use b712916d8bfc1718a431c7b4fa280ae6 to go to the right slot of the array? Also i am still confused finding the size of the array and if i will have to resize it from time to time . In my current exercise where i have to deal with so many strings (30 txts with stories) how long should be the array? – captain monk Apr 15 '13 at 15:32
  • What you are describing is an operation to get the value in O(1) tie, am I right? To do that,you can use an algorithm that returns an index of the array from the string.Like the example above. Then, when you get "ABC", you can directly go to index[2].Please keep in mind that this has a high probability for collision, and you will have to employ open-hashing or closed-hashing techniques. Also, you are creating hashes for the names of the files of the stories right?Can you tell me why are you choosing c[20] in the array along with the key in that case? What is c[20]?I can answer better with that – Binayaka Chakraborty Apr 15 '13 at 15:43
  • The time i want is O(1) right. With the hashing algorithm you wrote, for example, how many slots should my array have? you are right i will also include the names of the stories. Well i used c[20] because the user was giving a key and a word and the word would be less than 20 letters. – captain monk Apr 15 '13 at 15:53
  • If you know the number of stories, then you can initialize the number of array of structures to that. As for the word, what is the role of the word? In case the word is not need, the structure wouldn't even be needed. Just calculate the index from the filename, and assign the array to that – Binayaka Chakraborty Apr 15 '13 at 16:00
  • If you know the number of stories, then you can initialize the number of array of structures to that? The main idea is to make an inverted index (a dicitionary) of all the words of all the txts i will read. So how can i know the size of the array by the number of stories? :S Sorry for my english – captain monk Apr 15 '13 at 16:39
  • From here [www.n3labs.com/pdf/make-inverted.pdf], take a look at page 31. I believe it will be enough :) As for the number N, take a quick count of the number of text documents either thorough or ls, and call calloc() for the linked list nodes :) – Binayaka Chakraborty Apr 15 '13 at 16:54
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/28265/discussion-between-binayaka-chakraborty-and-user2145433) – Binayaka Chakraborty Apr 15 '13 at 16:56
1

A hash table can be implemented as a simple 2-dimensional array. The question is how to compute the unique key for each item to be stored. Some things have keys built into the data, and for other things you'll have to compute one: MD5 as suggested above is probably just fine for your needs.

The next problem you need to solve is how to lay out, or size, your hash table. That's something that you'll ultimately need to tune to your own needs through some testing. You might start by setting up the 1st dimension of your array with 255 entries -- one for each combination of the first 2 digits of the MD5 hash. Whenever you have a collision, you add another entry along the 2nd dimension of your array at that 1st dimension index. This means that you'll statically define a 1-dimensional array while dynamically allocating the 2nd dimension entries as needed. Hopefully that makes as much sense to you as it does to me.

When doing lookups, you can immediately find the right 1st dimension index using the 1st 2-digits of the MD5 hash. Then a relativley short linear search along the 2nd dimension will quickly bring you to the item you seek.

You might find from experimentation that it's more efficient to use a larger 1st dimension (use the fisrt 3 digits of the MD5 hash) if your data set is sufficiently large. Depending on the size of texts involved and the distribution of their use of the lexicon, your results will probably dictate some of your architecture.

On the other hand, you might just start small and build in some intelligence to automatically resize and layout your table. If your table gets too long in either direction, performance will suffer.

alpartis
  • 1,086
  • 14
  • 29
  • Thank you for this it's for sure a bit revealing. You propose to use the hastable with chaining. What i don't understand is for example MD5HASH("caesar") returns b712916d8bfc1718a431c7b4fa280ae6 so the first 2 of what it returns is b7 right? thats menas i will go to position array[b7]? Sorry if what i ask is too dump but i am still confused about this b712916d8bfc1718a431c7b4fa280ae6 because in my other exercise it returned only numbers < size of array so i would just go to array[number]. – captain monk Apr 15 '13 at 17:53
  • 1
    Yes, I am proposing a hash table with chaining. Think of each entry in the hash array as a bucket that contains a dynamically growing collection of items that collide/tie with the given hash calculation. To answer your question: yes, you would put your entry for "caesar" into the "bucket" hasharray[b7] in your example. – alpartis Apr 24 '13 at 14:53