1

I already wrote a working project but my problem is, it is way slower than what I aimed in the first place so I have some ideas about how to improve it but I don't know how to implement these ideas or should I actually implement these ideas in the first place?

The topic of my project is, reading a CSV (Excel) file full of tweets and counting every single word of it, then displaying most used words.
(Every row of the Excel there is information about the tweet and the tweet itself, I should only care about the tweet)

Instead of sharing the whole code I will just simply wrote what I did so far and only ask about the part I am struggling.
First of all, I want to apologize because it will be a long question.

Important note: Only thing I should focus is speed, storage or size is not a problem.

All the steps:

  • Read a new line from Excel file.
  • Find the "tweet" part from the whole line and store it as a string.
  • Split the tweet string into words and store it in the array.
  • For every word stored in an array, calculate the ASCII value of the word.

(For finding ascii value of the word I simply sum the ascii value of each letter it has)

  • Put the word in Hash Table with the key of ASCII value.

(Example: Word "hello" has ascii value of 104+101+108+108+111 = 532, so it stored with key 532 in the hast table)

  • In Hash Table only the word "as a string" and the key value "as an int" is stored and count of the words (how much the same word is used) is stored in a separated array.

I will share the Insert function (for inserting something to the Hashtable) because I believe it will be confusing if I will try to explain this part without a code.

void Insert(int key, string value) //Key (where we want to add), Value (what we want to add)
{

    if (key < 0) key = 0; //If key is somehow less than 0, for not taking any error key become 0.

    if (table[key] != NULL) //If there is already something in hast table 
    {       

        if (table[key]->value == value) //If existing value is same as the value we want to add
        {               
            countArray[key][0]++;
        }
        else //If value is different,
        {           
            Insert(key + 100, value);  //Call this function again with the key 100 more than before.
        }
    }
    else //There is nothing saved in this place so save this value
    {           
        table[key] = new HashEntry(key, value); 
        countArray[key][1] = key;
        countArray[key][0]++;           
    }

}

So "Insert" function has three-part.

  • Add the value to hash table if hast table with the given key is empty.
  • If hast table with the given key is not empty that means we already put a word with this ascii value.

Because different words can have exact same ascii value.

  • The program first checks if this is the same word.
  • If it is, count increase (In the count array).
  • If not, Insert function is again called with the key value of (same key value + 100) until empty space or same value is found.

After whole lines are read and every word is stored in Hashtable ->

Sort the Count array
Print the first 10 element


This is the end of the program, so what is the problem?

Now my biggest problem is I am reading a very huge CSV file with thousands of rows, so every unnecessary thing increases the time noticeably.

My second problem is there is a lot of values with the same ASCII value, my method of checking hundred more than normal ascii value methods work, but ? for finding the empty space or the same word, Insert function call itself hundred times per word.
(Which caused the most problem).

So I thought about using multiple Hashtables.
For example, I can check the first letter of the word and if it is
Between A and E, store in the first hash table
Between F and J, store in the second hash table
...
Between V and Z, store in the last hash table.

Important note again: Only thing I should focus is speed, storage or size is not a problem.

So conflicts should minimize mostly in this way.
I can even create an absurd amount of hash tables and for every different letter, I can use a different hash table.
But I am not sure if it is the logical thing to do or maybe there are much simpler methods I can use for this.

If it is okay to use multiple hash tables, instead of creating hash tables, one by one, is it possible to create an array which stores a Hashtable in every location? (Same as Array of Arrays but this time array store Hashtables) If it is possible and logical, can someone show how to do it?

This is the hash table I have:

class HashEntry
{
public:
    int key;
    string value;
    HashEntry(int key, string value)
    {
        this->key = key;
        this->value = value;
    }
};

class HashMap
{
private:
    HashEntry * *table;
public:
    HashMap()
    {
        table = new HashEntry *[TABLE_SIZE];
        for (int i = 0; i < TABLE_SIZE; i++)
        {
            table[i] = NULL;
        }
    }

//Functions

}

I am very sorry for such a long question I asked and I am again very sorry if I couldn't explain every part clear enough, English is not my mother language.

Also one last note, I am doing this for a school project so I shouldn't use any third party software or include any different libraries because it is not allowed.

Utkan
  • 129
  • 12
  • Are you prohibited from using standard library as well? – r3mus n0x Dec 21 '18 at 20:40
  • @r3musn0x In project it is written "(You should not use third party libraries including C++ STL, Boost etc. ). However, you can use, iostream, ctime, fstream, string-like IO and string related classes. " So probably yes I need to use my own hash table. – Utkan Dec 21 '18 at 20:40
  • Why do you store a word count in a separate array instead of the hash table? – r3mus n0x Dec 21 '18 at 20:57
  • @r3musn0x Because it was easier for me to sort the array than trying the sort hash table with multiple values. I don't think it create that much time loss but if i am wrong i will try to change. – Utkan Dec 21 '18 at 20:58

1 Answers1

1

You are using a very bad hash function (adding all characters), that's why you get so many collisions and your Insert method calls itself so many times as a result.

For a detailed overview of different hash functions see the answer to this question. I suggest you try DJB2 or FNV-1a (which is used in some implementations of std::unordered_map).

You should also use more localized "probes" for the empty place to improve cache-locality and use a loop instead of recursion in your Insert method.

But first I suggest you tweak your HashEntry a little:

class HashEntry
{
public:
    string key; // the word is actually a key, no need to store hash value
    size_t value; // the word count is the value.
    HashEntry(string key)
        : key(std::move(key)), value(1) // move the string to avoid unnecessary copying
    { }
};

Then let's try to use a better hash function:

// DJB2 hash-function
size_t Hash(const string &key)
{
    size_t hash = 5381;
    for (auto &&c : key)
        hash = ((hash << 5) + hash) + c;
    return hash;
}

Then rewrite the Insert function:

void Insert(string key)
{
    size_t index = Hash(key) % TABLE_SIZE;

    while (table[index] != nullptr) {       
        if (table[index]->key == key) {               
            ++table[index]->value;
            return;
        }
        ++index;
        if (index == TABLE_SIZE) // "wrap around" if we've reached the end of the hash table
            index = 0;
    }           

    table[index] = new HashEntry(std::move(key));
}

To find the hash table entry by key you can use a similar approach:

HashEntry *Find(const string &key)
{
    size_t index = Hash(key) % TABLE_SIZE;

    while (table[index] != nullptr) {       
        if (table[index]->key == key) {               
            return table[index];
        }
        ++index;
        if (index == TABLE_SIZE)
            index = 0;
    }           

    return nullptr;
}
r3mus n0x
  • 5,954
  • 1
  • 13
  • 34
  • Thank you for the advice and showing what I am doing wrong, I tried to implement what you said but I am getting errors in two part. First "return value type does not match the function type" error in the "void Hash" and "expression must have integral or unscoped enum type" error in the Insert Function line "size_t index = Hash(key) % TABLE_SIZE;" (Hash(key) part) I wrote the "void Hash" inside of "class HashMap", maybe that's what I did wrong. – Utkan Dec 21 '18 at 22:31
  • @whiragon, sorry, my mistake. Of course `Hash` should return `size_t`, I've updated the code. – r3mus n0x Dec 21 '18 at 22:33
  • @whiragon, I don't think you need multiple hash tables unless you have several different sets of keys. Given that you only have words that are "mapped" to their respective counts - one hash table should be enough. If you, for example, needed the reverse mapping (count -> words) then you'll have to have a separate hash table for that. – r3mus n0x Dec 21 '18 at 22:38
  • Does using multiple Hashtables make any sense? Or is it completely unnecessary?. Also how I will read the values from the table? i was using the line "string value = table[int value]->value;" but i couldn't figure out how i should do it in new format – Utkan Dec 21 '18 at 22:39
  • @whiragon, I've updated the answer with the example. – r3mus n0x Dec 21 '18 at 22:48
  • Thank you for keep helping me i get it now and sorry for keep asking questions but I got another error in Insert function line "while (table[index] != nullptr) " it give error of "Exception thrown: read access violation. this->table was 0x1BEF00D". Any idea why? – Utkan Dec 21 '18 at 22:58
  • I edited the question and shared the current code so far, I removed the string editing parts from the code for better readability because I believe they are not necessary. – Utkan Dec 21 '18 at 23:22
  • @whiragon, I think you've accidentally removed the `HashMap` constructor that you had before. – r3mus n0x Dec 21 '18 at 23:26
  • Yes that's was the case, thank you so much for the all the help. Program work perfectly and 20 times faster. Now the only thing left is sorting the Hashtable but I think I can do that much by myself. – Utkan Dec 22 '18 at 00:57
  • @whiragon I assume the top list is a lot smaller than the hashtable. Instead of sorting hashtable just maintain the top list while you build the hashtable. Make the list in such a way that the count of the last item in the list is directly accessible, so that updating the list is fast. – Dialecticus Dec 22 '18 at 01:33
  • @Dialecticus Sorry I couldn't understand completely, i don't have a any list for now so are you suggesting that I should create a list and keep the top values there? But for this don't i need to know top values in the first place? – Utkan Dec 22 '18 at 02:13
  • @Liakan words would enter and leave the list as you read the tweets. Each time some word has bigger count then the word with smallest count in the list you add the former word to the list and remove the latter. But I just realized, you can construct the list in the end too. Just don't sort the whole table, because sorting items that will not be in the list in the end is a waste of time. – Dialecticus Dec 22 '18 at 09:22