I already wrote a working project but my problem is, it is way slower than what I aimed in the first place so I have some ideas about how to improve it but I don't know how to implement these ideas or should I actually implement these ideas in the first place?
The topic of my project is, reading a CSV (Excel) file full of tweets and counting every single word of it, then displaying most used words.
(Every row of the Excel there is information about the tweet and the tweet itself, I should only care about the tweet)
Instead of sharing the whole code I will just simply wrote what I did so far and only ask about the part I am struggling.
First of all, I want to apologize because it will be a long question.
Important note: Only thing I should focus is speed, storage or size is not a problem.
All the steps:
- Read a new line from Excel file.
- Find the "tweet" part from the whole line and store it as a string.
- Split the tweet string into words and store it in the array.
- For every word stored in an array, calculate the ASCII value of the word.
(For finding ascii value of the word I simply sum the ascii value of each letter it has)
- Put the word in Hash Table with the key of ASCII value.
(Example: Word "hello" has ascii value of 104+101+108+108+111 = 532, so it stored with key 532 in the hast table)
- In Hash Table only the word "as a string" and the key value "as an int" is stored and count of the words (how much the same word is used) is stored in a separated array.
I will share the Insert function (for inserting something to the Hashtable) because I believe it will be confusing if I will try to explain this part without a code.
void Insert(int key, string value) //Key (where we want to add), Value (what we want to add)
{
if (key < 0) key = 0; //If key is somehow less than 0, for not taking any error key become 0.
if (table[key] != NULL) //If there is already something in hast table
{
if (table[key]->value == value) //If existing value is same as the value we want to add
{
countArray[key][0]++;
}
else //If value is different,
{
Insert(key + 100, value); //Call this function again with the key 100 more than before.
}
}
else //There is nothing saved in this place so save this value
{
table[key] = new HashEntry(key, value);
countArray[key][1] = key;
countArray[key][0]++;
}
}
So "Insert" function has three-part.
- Add the value to hash table if hast table with the given key is empty.
- If hast table with the given key is not empty that means we already put a word with this ascii value.
Because different words can have exact same ascii value.
- The program first checks if this is the same word.
- If it is, count increase (In the count array).
- If not, Insert function is again called with the key value of (same key value + 100) until empty space or same value is found.
After whole lines are read and every word is stored in Hashtable ->
Sort the Count array
Print the first 10 element
This is the end of the program, so what is the problem?
Now my biggest problem is I am reading a very huge CSV file with thousands of rows, so every unnecessary thing increases the time noticeably.
My second problem is there is a lot of values with the same ASCII value, my method of checking hundred more than normal ascii value methods work, but ? for finding the empty space or the same word, Insert function call itself hundred times per word.
(Which caused the most problem).
So I thought about using multiple Hashtables.
For example, I can check the first letter of the word and if it is
Between A and E, store in the first hash table
Between F and J, store in the second hash table
...
Between V and Z, store in the last hash table.
Important note again: Only thing I should focus is speed, storage or size is not a problem.
So conflicts should minimize mostly in this way.
I can even create an absurd amount of hash tables and for every different letter, I can use a different hash table.
But I am not sure if it is the logical thing to do or maybe there are much simpler methods I can use for this.
If it is okay to use multiple hash tables, instead of creating hash tables, one by one, is it possible to create an array which stores a Hashtable in every location? (Same as Array of Arrays but this time array store Hashtables) If it is possible and logical, can someone show how to do it?
This is the hash table I have:
class HashEntry
{
public:
int key;
string value;
HashEntry(int key, string value)
{
this->key = key;
this->value = value;
}
};
class HashMap
{
private:
HashEntry * *table;
public:
HashMap()
{
table = new HashEntry *[TABLE_SIZE];
for (int i = 0; i < TABLE_SIZE; i++)
{
table[i] = NULL;
}
}
//Functions
}
I am very sorry for such a long question I asked and I am again very sorry if I couldn't explain every part clear enough, English is not my mother language.
Also one last note, I am doing this for a school project so I shouldn't use any third party software or include any different libraries because it is not allowed.