1

I want to create a hash table for an exercise I have to send in my University. The program will open a number of files, break each file's content to <<words>> (tokens) and it will save each <<word>> in a hash table with the frequency of each <<word>>.

In case the word is already in the hash table , the program will increase the word's frequency.

At the end the program will print the words and it's frequencies accordingly. Also the frequencies should be printed from the highest word frequency to the lowest. The comparison of the <<words>> will ignore upper and lower case letters.

For example if a file contains : one two three four Two Three Four THREE FOUR FoUr It should print:

four 4
three 3
two 2
one 1

The professor gave us a template that we should complete but I'm really confused on what to do with the insert_ht() and clear_ht() functions as well as the compare one.

Here is the code :

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <ctype.h>

#define HTABLE_SIZ 1001
#define MAX_LINE_SIZ 1024

/* Hash Table */
typedef struct node* link;
struct node { char *token; int freq; link next; };

link htable[HTABLE_SIZ] = { NULL }; /* Table of lists (#buckets) */
int size = 0; /* Size (number of elements) of hash table */

unsigned int hash (char *tok );
void insert_ht (char *data);
void clear_ht ( );
void print_ht ( );

void Process(FILE *fp);


int main(int argc, char *argv[])
{
    int i;
    FILE *fp;
    for (i=1; i < argc; i++)
    {
        fp = fopen(argv[i],"r");
        if (NULL == fp)
        {
            fprintf(stderr,"Problem opening file: %s\n",argv[i]);
            continue;
        }
    Process(fp);
    fclose(fp);
    }
    print_ht();
    clear_ht();
    return 0;
}


void Process(FILE *fp)
{
    const char *seperators = " ?!'\";,.:+-*&%(){}[]<>\\\t\n";

    char line[MAX_LINE_SIZ];
    char *s;
    while((fgets(line,MAX_LINE_SIZ, fp)) != NULL)
    {
        for (s=strtok(line,seperators); s; s=strtok(NULL,seperators))
            insert_ht(s);
        }
    }

/* Hash Function */
unsigned int hash(char *tok)
{
    unsigned int hv = 0;
    while (*tok)
        hv = (hv << 4) | toupper(*tok++);
    return hv % HTABLE_SIZ;
}


void insert_ht(char *token)
{
……………………………………………
}
void clear_ht()
{
……………………………………………
}
int compare(const void *elem1, const void *elem2)
{
……………………………………………
}
void print_ht()
{
    int i, j=0;
    link l, *vector = (link*) malloc(sizeof(link)*size);
    for (i=0; i < HTABLE_SIZ; i++)
        for (l=htable[i]; l; l=l->next)
            vector[j++] = l;
        qsort(vector,size,sizeof(link),compare);
        for (i=0; i < size; i++)
            printf("%-50s\t%7d\n",vector[i]->token,vector[i]->freq);
        free(vector);
}
Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
Scotoner
  • 27
  • 1
  • 4
  • 1
    Standard warning: Please [do not cast](http://stackoverflow.com/q/605845/2173917) the return value of `malloc()`. – Sourav Ghosh Dec 23 '14 at 13:38
  • Haven't got to the compile step yet. Trying to figure out the main functions in the program. Thanks for the tip though :) – Scotoner Dec 23 '14 at 13:52

3 Answers3

1

Double and Final edit : Ι found the solution. Apparently for some reason my compare function was wrong. I still haven't figured out why but here is the correct one, hopefully someone else will find this post helpful!

int compare(const void *elem1, const void *elem2)
{
    
     return (*(link*)elem2)->freq - (*(link*)elem1)->freq;
}

Edit: deleted old answer . Found the correct way I think but I have another problem right now. The compare function doesn't work correctly. My printf is fine but it doesnt sort them with the frequiencies. I want them to be sorted from the highest to lowest .

In this example: the file contains -> one two three four Two Three Four THREE FOUR FoUr And I get: two 2 one 1 four 4 three 3

While I should be getting : four 4 three 3 two 2 one 1

Here is the code. Feel free to help!

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <ctype.h>

#define HTABLE_SIZ 1001
#define MAX_LINE_SIZ 1024

/* Hash Table */
typedef struct node* link;
struct node { char *token; int freq; link next; };

link htable[HTABLE_SIZ] = { NULL }; /* Table of lists (#buckets) */
int size = 0; /* Size (number of elements) of hash table */

unsigned int hash (char *tok );
void insert_ht (char *data);
void clear_ht ( );
void print_ht ( );

void Process(FILE *fp);


int main(int argc, char *argv[])
{
    int i;
    FILE *fp;
    printf("prin tin for \n");
    for (i=1; i < argc; i++)
    {
        printf("prin tin fopen \n");
        fp = fopen(argv[i],"r");
        if (NULL == fp)
        {
            fprintf(stderr,"Problem opening file: %s\n",argv[i]);
            continue;
        }
        printf("prin tin process \n");
    Process(fp);
    fclose(fp);
    }
    print_ht();
    //clear_ht();
    return 0;
}


void Process(FILE *fp)
{
    const char *seperators = " ?!'\";,.:+-*&%(){}[]<>\\\t\n";

    char line[MAX_LINE_SIZ];
    char *s;
    while((fgets(line,MAX_LINE_SIZ, fp)) != NULL)
    {
        for (s=strtok(line,seperators); s; s=strtok(NULL,seperators)){
            printf("prin tin insert %s \n",s);
            insert_ht(s);
        }
            
        }
    }
    
/* Hash Function */
unsigned int hash(char *tok)
{
    printf("bike stin hash \n");
    unsigned int hv = 0;
    while (*tok)
        hv = (hv << 4) | toupper(*tok++);
    printf("VGAINEIIIIIIIIIIIIII %d \n",hv);
    return hv % HTABLE_SIZ;
}



void insert_ht(char *token)
{
    printf("bike stin insert %s \n",token);
    unsigned int hashval = hash(token);

    if (htable[hashval]==NULL){
        printf("mesa stin prwti if %u %s \n",hashval,token);
        //token = strdup(token);
        htable[hashval] = malloc(sizeof(token));
        htable[hashval]->token = token ;
        htable[hashval]->freq = 1;
        size++;
        
    }else {
        htable[hashval]->freq++;
    }
    printf("ta evale epitixws \n");
    
}



int compare(const void *elem1, const void *elem2)
{
    const struct node *p1 = elem1;    
    const struct node *p2 = elem2;
    
    if ( p1->freq < p2->freq)
      return -1;

   else if (p1->freq > p2->freq)
      return 1;

   else
      return 0;
}
void print_ht()
{
    int i, j=0;
    link l, *vector = (link*) malloc(sizeof(link)*size);
    for (i=0; i < HTABLE_SIZ; i++)
        for (l=htable[i]; l; l=l->next)
            vector[j++] = l;
        qsort(vector,size,sizeof(link),compare);
        for (i=0; i < size; i++)
            printf("%-50s\t%7d\n",vector[i]->token,vector[i]->freq);
        free(vector);
} 
Scotoner
  • 27
  • 1
  • 4
1

I'll answer you in a new post because it's hard to be exhaustive in comments.

1. Malloc

Why would I need to use malloc then ? Shouldn't i write directly to the htable? (on the insert_ht() funtion)

You need to use malloc because you declare a char pointer in struct (char *token). The thing is that you never initialize the pointer to anything, and as far you don't know the size of the token, you need to malloc every token.

But, as you use strdup(token), you don't need to malloc token because strdup does. So don't forget to free every token in order to avoid memory leaks.

2. Segfault

I can't test you code, but it seems like the following line causes the segmentation fault :

list = htable[hashval]->token 

Indeed, you try to access token while htable[hashval] is NULL, and to assign a char * to a link type (list).

You need to loop with this :

for(list = htable[hashval]; list != NULL; list = list->next) { ... }

3. Notes

  • if (x=1) should be if(x==1).
  • Don't malloc new_list if you don't need to.
  • Because new_list if used when htable[hashval] is NULL, new_list->next = htable[hashval]; will set new_list->next to NULL.
  • You should use the -Wall option in gcc (for warnings) and you may use valgrind to understand your segmentation faults. In this case, use gcc with debug mode (-g).
Chostakovitch
  • 965
  • 1
  • 9
  • 33
  • Thanks for this. I think the main problem is with the fopen in the main program. I have no idea why i get the segmentation error . I run it like `./program_name ~/workspace/file` – Scotoner Dec 24 '14 at 16:18
  • Never mind. The error still exists inside `insert_ht()`. I changed what u said but still getting seg.fault. – Scotoner Dec 24 '14 at 16:56
  • @Scotoner Until today, i could not run your program. I will try it this afternoon, and we'll found what is the current problem. – Chostakovitch Dec 26 '14 at 08:26
0

Sorry for my bad english.

I think that :

  1. insert(char *token) takes a word of the file and puts into the hash table. In brief, if the word exists in the hash table, you just have to increment its frequencie. Otherwise, you need to create another node and put the frequencie to 1, then ad it to the array. At the end, you will have one entry for each unique word.

  2. compare(const void *elem1, const void *elem2) will be used by qsort. It returns 0 if elem1 = elem2, a negative number if elem1 < elem2 and a number > 0 if elem1 > elem2. By passing compare to qsort, you allow qsort to sort you array according to your own criteria.

  3. clear_ht() may set all the values of the array to NULL, in order to restart another count ?

Chostakovitch
  • 965
  • 1
  • 9
  • 33
  • I'm having trouble figuring how to do the insert function mostly. I did this for the clear_ht(): `void clear_ht(link htable) { free(htable->token); free(htable->freq); free(htable); }` – Scotoner Dec 23 '14 at 13:57
  • @Scotoner You don't need to take htable as a parameter, as it is a global variable. Additionnaly, you don't need to `free(htable)` because you didn't malloc'd it. Same thing for freq, it is an integer. But you will need to free every token because you will malloc them in the insert function, like this : `for(i = 0; i < HTABLE_SIZ; i++) { free(htable[i].token); }` – Chostakovitch Dec 23 '14 at 14:02
  • yeah but if i free every token shouldnt i free the frequencies as well? – Scotoner Dec 23 '14 at 14:09
  • Also I dont understand how will I use the `unsigned int hash(chat *tok)` function. – Scotoner Dec 23 '14 at 14:11
  • @Scotoner, you can't free something that hasn't been malloc'd or calloc'd. The reason is that malloc allocate the memory on the heap, and you need to manage yourself this memory. You may read this in order to help you to understand : http://gribblelab.org/CBootcamp/7_Memory_Stack_vs_Heap.html . So you will need to free tokens, because you need to malloc them, because you declare them with the `*` and not `[x]` I can only speculate concerning hash : usually, a hashCode should be a unique value associated with an element. So, if two elements have different hashcodes, they are different. – Chostakovitch Dec 23 '14 at 14:25
  • @Scotoner But if two hashcodes are equals, this doesn't mean that elements are equals, because of collisions. So I think the hash function allows you to know very fast if two elements are NOT equals, and then you can do the full check. – Chostakovitch Dec 23 '14 at 14:30
  • Alright I understand that now . I still dont know how I'm gonna make the insert. I also need to check if the word already exists in the table? Will I use loop for this? – Scotoner Dec 23 '14 at 14:39
  • @Scotoner : Here is my idea. I'm not a pro, so maybe this isn't the best way to do it. In insert, you loop through your array and you compare the token with the new token. I suggest that you use `compare()` for that. If you find a token with is equal, just increment freq. Else, add a new node at the end of the array, with freq = 1. I think that compare may cast the two void* in char* (we need void* for qsort) and then return `strcmp(elem1, elem2)` which does exactly what you want. – Chostakovitch Dec 23 '14 at 14:51
  • First of all , thanks for being so helpful. Secondly, I should compare the hash codes right? Considering : four = FoUr = fOUr. – Scotoner Dec 23 '14 at 15:21
  • @Sconoter Oh, well, I totally forgetted that the case was insensitive. To be honest, I don't understand the syntax used in the hash function, but it seems like it considers all characters as uppercase characters, so you can use it ! So maybe you could just compare the hash codes, but I wonder about collisions... I can't be helpful about that. – Chostakovitch Dec 23 '14 at 15:28
  • Why would I need to use malloc then. Shouldn't i write directly to the htable? (on the `insert_ht()` funtion. – Scotoner Dec 23 '14 at 18:03