C Hash table sizing issue

Question

Let me start off by saying I know very basic, basic C. I have implemented a Hash Table using the code from the tutorialspoint website.

My program is reading in positive whole numbers from a text file, putting the number and the number of times it occurs in the file as the key and value respectively. It works nicely actually, but I face a big problem: I need to know the size of the hash table beforehand. I noticed that when I skip over a negative number, it weirdly messes up the structure of the table.

Here is my code (if using a size of 20):

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <stdbool.h>

#define SIZE 20

struct DataItem {
  int key;
  int data;
};

struct DataItem* hashArray[SIZE];
struct DataItem* item;

int hashCode(int key) {
  return key % SIZE;
}

struct DataItem *search(int key) {
  int hashIndex = hashCode(key);

  while(hashArray[hashIndex] != NULL) {

    if(hashArray[hashIndex]->key == key)
    return hashArray[hashIndex];

    ++hashIndex;
    hashIndex %= SIZE;
  }

  return NULL;
}

void insert(int key,int data) {
  struct DataItem *item = (struct DataItem*) malloc(sizeof(struct DataItem));
  item->data = data;
  item->key = key;
  int hashIndex = hashCode(key);

  while(hashArray[hashIndex] != NULL && hashArray[hashIndex]->key != -1) {
    ++hashIndex;
    hashIndex %= SIZE;
  }

  hashArray[hashIndex] = item;
}

void display() {
  int key_index;
  for(key_index = 0; key_index<SIZE; key_index++) {
    if(hashArray[key_index] != NULL) {
      printf("%d has occurred %d time(s)\n", hashArray[key_index]->key, hashArray[key_index]->data);
    }
  }
  printf("\n");
}

void sort(size_of_struct) {
  int i, tempkey, tempdata;

  if(size_of_struct == 1) {
    return;
  }

  for(i=0; i<(size_of_struct-1); i++) {
    if(hashArray[i] != NULL) {
      if(hashArray[i]->key > hashArray[i+1]->key) {
        tempkey = hashArray[i]->key;
        hashArray[i]->key = hashArray[i+1]->key;
        hashArray[i+1]->key = tempkey;

        tempdata = hashArray[i]->data;
        hashArray[i]->data = hashArray[i+1]->data;
        hashArray[i+1]->data = tempdata;
      }
    }
  }
  sort(size_of_struct-1);
}

int main() {
  FILE *myFile;
  myFile = fopen("test.txt", "r");

  int *numberArray = NULL;
  int i, count, number, current_num, temp;
  count = number = temp = 0;

  numberArray = malloc(sizeof(int)*500);

  while(fscanf(myFile, "%d", &number) > 0) {
    if(number >= 0 && number <= 100) {
        numberArray[count] = number;
        count++;
    }
  }

  for(i=0; i < count; i++)
  {
    item = search(numberArray[i]);
    if(item == NULL) {
      insert(numberArray[i], 1);
    } else {
      item = search(numberArray[i]);
      item->data += 1;
    }
    free(numberArray);
    numberArray = malloc(sizeof(int)*500);
  }

  display();
  sort(SIZE);
}

If I run this code, I get the following result:

80 has occurred 1 time(s)
100 has occurred 1 time(s)
2 has occurred 1 time(s)
3 has occurred 2 time(s)
84 has occurred 1 time(s)
12 has occurred 1 time(s)
33 has occurred 1 time(s)
99 has occurred 3 time(s)

Segmentation fault: 11

If I use a SIZE of 8 (which is the total number of numbers that I am using from the file), then I get the following result:

80 has occurred 1 time(s)
33 has occurred 1 time(s)
2 has occurred 1 time(s)
99 has occurred 3 time(s)
3 has occurred 2 time(s)
12 has occurred 1 time(s)
100 has occurred 1 time(s)
84 has occurred 1 time(s)

2 has occurred 1 time(s)
3 has occurred 2 time(s)
12 has occurred 1 time(s)
33 has occurred 1 time(s)
80 has occurred 1 time(s)
84 has occurred 1 time(s)
99 has occurred 3 time(s)
100 has occurred 1 time(s)

Upon further investigation, I noticed this happening in my display function:

key_index: 0
80 has occurred 1 time(s)
key_index: 1
100 has occurred 1 time(s)
key_index: 2
2 has occurred 1 time(s)
key_index: 3
3 has occurred 2 time(s)
key_index: 4
84 has occurred 1 time(s)
key_index: 5
key_index: 6
key_index: 7
key_index: 8
key_index: 9
key_index: 10
key_index: 11
key_index: 12
12 has occurred 1 time(s)
key_index: 13
33 has occurred 1 time(s)
key_index: 14
key_index: 15
key_index: 16
key_index: 17
key_index: 18
key_index: 19
99 has occurred 3 time(s)

Segmentation fault: 11

I am not entirely sure what is happening. I know why the segmentation fault is occurring: it is trying to reference a value that is null at that index. What I don't know is why (inside of the display function) does it skip over the key_index values from 5-11 and 14-18; that is my real concern. I am assuming that because I am not using all of the numbers in the file, the structure is trying to populate it so that there are always 20 spaces. Am I wrong? Is there a way I can fix this?

My test.txt file:

99 2 99
3
80 12 -12 33
3 99 100 1234 84

Use a debugger. Get a backtrace from the point of the segfault. See what variables' values are on the way to that. And so on. — underscore_d, Nov 30 '17 at 17:29
I know why the segment fault is occurring: the index is out of range. The reason for this is explained above. When looping through the display function, the index value skips through a bunch of numbers. — , Nov 30 '17 at 18:05

score 1 · Answer 1 · answered Nov 30 '17 at 18:10

Hey looks like you've got a segmentation fault there. Darn, those can be pretty nasty to track down. As @underscore_d mentions, you can run your program through a debugger to examine the code while it's in flight.

However, you have fortunately provided a (mostly) Minimal, Complete and Verifiable example for us to take a look at. I provided my own test.txt file but was able to reproduce a crash. Now, this may not be the crash that you are seeing since I don't have your test.txt, but hopefully the following analysis will help you out anyway.

Since your example is short, you can also consider running your code through some analysis tools. I regularly run all my code through AddressSanitizer and Valgrind to help avoid introducing these kinds of errors. In your case, AddressSanitizer pinpoints the location of the error.

$ gcc -fsanitize=address -g test.c -o test
$ ./test
80 has occurred 1 time(s)

ASAN:DEADLYSIGNAL
=================================================================
==18300==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000000 (pc 0x561f6050b39f bp 0x7ffe4d6866b0 sp 0x7ffe4d686690 T0)
    #0 0x561f6050b39e in sort /home/v1bri/test.c:67
    #1 0x561f6050bae2 in main /home/v1bri/test.c:111
    #2 0x7f60938ee3f0 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x203f0)
    #3 0x561f6050acb9 in _start (/home/v1bri/test+0xcb9)

AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV /home/v1bri/test.c:67 in sort
==18300==ABORTING

The error occurs on line 67 in the sort() function. Sure enough...

void sort(size_of_struct) {
// ...

  for(i=0; i<(size_of_struct-1); i++) {
    if(hashArray[i] != NULL) {
      if(hashArray[i]->key > hashArray[i+1]->key) {

You forgot to check for NULL when dereferencing the pointer at hashArray[i+1]. Valgrind will give you quite a few more errors but will also locate the culprit.

$ gcc -g test.c -o test
$ valgrind --tool=memcheck ./test
==18764== Memcheck, a memory error detector
==18764== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
==18764== Using Valgrind-3.12.0 and LibVEX; rerun with -h for copyright info

<snip>

==18764== Invalid read of size 4
==18764==    at 0x108B0D: sort (test.c:67)
==18764==    by 0x108DA7: main (test.c:111)
==18764==  Address 0x0 is not stack'd, malloc'd or (recently) free'd
==18764==
==18764==
==18764== Process terminating with default action of signal 11 (SIGSEGV)
==18764==  Access not within mapped region at address 0x0
==18764==    at 0x108B0D: sort (test.c:67)
==18764==    by 0x108DA7: main (test.c:111)

I know you mentioned that you know "very basic, basic C", but hopefully you can see how easy it is to use modern debugging tools to test your program for errors. Learning tools can be daunting for a new programmer, but this is actually the best time to learn because it will save you so much time down the road.

I updated my question to include the .txt file. I know why the segmentation fault is occurring. It is trying to reference an index that is out of bounds. If you look at the display result I print out the key_index value and it is showing a skip between numbers 5-11 and then again from 14-18. So, rather than putting the numbers in key_indices 0-8, it is putting them in 0-4, 12, 13 and 19 and skips 5-11 and skips 14-18. — , Nov 30 '17 at 18:15
Found another problem that might be related. The final loop in `main()` seems to be releasing and re-allocating `numberArray` each time it loops around. You should only release `numberArray` when you're finished using it, after the loop finishes. There is no guarantee that a `free()` followed by a `malloc()` of the same size will give you back the exact same buffer that was just released. — v1bri, Nov 30 '17 at 18:40
I fixed it, but I am still having the same error with why its skipping those specific numbers. If i change SIZE to 8, it works just fine because I'm only using 8 numbers from the file — , Nov 30 '17 at 18:43
I'm not sure if it has to do with the way the structure is being created. — , Nov 30 '17 at 18:49
That is how the `%` operator in your hash function works. None of the numbers in your file hashed to those slots (nor ones that would spill over), so they were not filled. If you had included `36` for example, it would hash to `16` and populate that slot. As you mention, inserting 8 different numbers to a hash table of size 8 guarantees that all slots are filled. — v1bri, Nov 30 '17 at 18:51
Oh okay. Is there a way to change it so that I can fill all of those spaces without having it skip? Or so that the number is put into the first slot, the second number in the second slot, and so on? — , Nov 30 '17 at 18:56
Well you're only inserting 9 distinct numbers (2, 3, 12, 33, 80, 84, 99, 100, 1234) so you will only be able to fill 9 slots. Typically a real hash table will start with a small number of slots (ex: 8) and dynamically resize itself as entries are added. For performance reasons you never want to have a completely full hash table. See the concept of [load](https://stackoverflow.com/questions/10901752/what-is-the-significance-of-load-factor-in-hashmap) [factor](https://en.wikipedia.org/wiki/Hash_table#Key_statistics) for information on how a hash table decides when to grow or shrink. — v1bri, Nov 30 '17 at 19:05
I'm not exactly sure how to do that unless I re-write the entire structure of the table. Like I said, I know basic c, and I'm borrowing the code from the aforementioned website — , Nov 30 '17 at 19:09
You may want to consider a vector instead of a hash table if you're looking for contiguous storage of elements. — v1bri, Nov 30 '17 at 19:21

C Hash table sizing issue

1 Answers1