3

I have been working on a program, simulating a small database where I could make queries, and after writing the code, I have executed it, but the performance is quite bad. It works really slow. I have tried to improve it, but I started with C++ on my own a few months ago, so my knowledge is still very low. So I would like to find a solution to improve the performance.

Let me explain how my code works. Here I have atached a summarized example of how my code works.

First of all I have a .txt file simulating a database table with random strings separated with "|". Here you have an example of table (with 5 rows and 5 columns).

Table.txt

0|42sKuG^uM|24465\lHXP|2996fQo\kN|293cvByiV
1|14772cjZ`SN|28704HxDYjzC|6869xXj\nIe|27530EymcTU
2|9041ByZM]I|24371fZKbNk|24085cLKeIW|16945TuuU\Nc
3|16542M[Uz\|13978qMdbyF|6271ait^h|13291_rBZS
4|4032aFqa|13967r^\\`T|27754k]dOTdh|24947]v_uzg

This information in a .txt file is read by my program and stored in the computer memory. Then, when making queries, I will access to this information stored in the computer memory. Loading the data in the computer memory can be a slow process, but accessing to the data later will be faster, what really matters me.

Here you have the part of the code that read this information from a file and store in the computer.

Code that reads data from the Table.txt file and store it in the computer memory

string ruta_base("C:\\a\\Table.txt"); // Folder where my "Table.txt" is found

string temp; // Variable where every row from the Table.txt file will be firstly stored
vector<string> buffer; // Variable where every different row will be stored after separating the different elements by tokens.
vector<ElementSet> RowsCols; // Variable with a class that I have created, that simulated a vector and every vector element is a row of my table

ifstream ifs(ruta_base.c_str());

while(getline( ifs, temp )) // We will read and store line per line until the end of the ".txt" file. 
{
    size_t tokenPosition = temp.find("|"); // When we find the simbol "|" we will identify different element. So we separate the string temp into tokens that will be stored in vector<string> buffer

    while (tokenPosition != string::npos)
    {    
        string element;
        tokenPosition = temp.find("|");      

        element = temp.substr(0, tokenPosition);
        buffer.push_back(element);
        temp.erase(0, tokenPosition+1);
    }

    ElementSet ss(0,buffer); 
    buffer.clear();
    RowsCols.push_back(ss); // We store all the elements of every row (stores as vector<string> buffer) in a different position in "RowsCols" 
}

vector<Table> TablesDescriptor;

Table TablesStorage(RowsCols);
TablesDescriptor.push_back(TablesStorage);

DataBase database(1, TablesDescriptor);

After this, comes the IMPORTANT PART. Let's suppose that I want to make a query, and I ask for input. Let's say that my query is row "n", and also the consecutive tuples "numTuples", and the columns "y". (We must say that the number of columns is defined by a decimal number "y", that will be transformed into binary and will show us the columns to be queried, for example, if I ask for columns 54 (00110110 in binary) I will ask for columns 2, 3, 5 and 6). Then I access to the computer memory to the required information and store it in a vector shownVector. Here I show you the part of this code.

Code that access to the required information upon my input

int n, numTuples; 
unsigned long long int y;
clock_t t1, t2;

cout<< "Write the ID of the row you want to get more information: " ;
cin>>n; // We get the row to be represented -> "n"

cout<< "Write the number of followed tuples to be queried: " ;
cin>>numTuples; // We get the number of followed tuples to be queried-> "numTuples"

cout<<"Write the ID of the 'columns' you want to get more information: ";
cin>>y; // We get the "columns" to be represented ' "y"

unsigned int r; // Auxiliar variable for the columns path
int t=0; // Auxiliar variable for the tuples path
int idTable;

vector<int> columnsToBeQueried; // Here we will store the columns to be queried get from the bitset<500> binarynumber, after comparing with a mask
vector<string> shownVector; // Vector to store the final information from the query
bitset<500> mask;
mask=0x1;

t1=clock(); // Start of the query time

bitset<500> binaryNumber = Utilities().getDecToBin(y); // We get the columns -> change number from decimal to binary. Max number of columns: 5000

// We see which columns will be queried
for(r=0;r<binaryNumber.size();r++) //
{               
    if(binaryNumber.test(r) & mask.test(r))  // if both of them are bit "1"
    {
        columnsToBeQueried.push_back(r);
    }
    mask=mask<<1;   
}

do
{
    for(int z=0;z<columnsToBeQueried.size();z++)
    {
        int i;
        i=columnsToBeQueried.at(z);

        vector<int> colTab;
        colTab.push_back(1); // Don't really worry about this

        //idTable = colTab.at(i);   // We identify in which table (with the id) is column_i
        // In this simple example we only have one table, so don't worry about this

        const Table& selectedTable = database.getPointer().at(0); // It simmulates a vector with pointers to different tables that compose the database, but our example database only have one table, so don't worry            ElementSet selectedElementSet;

        ElementSet selectedElementSet;

        selectedElementSet=selectedTable.getRowsCols().at(n);
        shownVector.push_back(selectedElementSet.getElements().at(i)); // We save in the vector shownVector the element "i" of the row "n"

    }   
    n=n+1;
    t++;            

}while(t<numTuples);

t2=clock(); // End of the query time

float diff ((float)t2-(float)t1);
float microseconds = diff / CLOCKS_PER_SEC*1000000;

cout<<"The query time is: "<<microseconds<<" microseconds."<<endl;

Class definitions

Here I attached some of the class definitions so that you can compile the code, and understand better how it works:

class ElementSet
{
private:
    int id;
    vector<string> elements; 

public:
    ElementSet(); 
    ElementSet(int, vector<string>); 

    const int& getId();
    void setId(int);

    const vector<string>& getElements();
    void setElements(vector<string>);

};

class Table
{
private:
    vector<ElementSet> RowsCols; 

public:
    Table(); 
    Table(vector<ElementSet>); 

    const vector<ElementSet>& getRowsCols();
    void setRowsCols(vector<ElementSet>);
};


class DataBase
{
     private:
        int id;
        vector<Table> pointer; 

     public:
        DataBase(); 
        DataBase(int, vector<Table>); 

    const int& getId();
    void setId(int);

    const vector<Table>& getPointer();
    void setPointer(vector<Table>);

    };

class Utilities
{
        public:
        Utilities();
        static bitset<500> getDecToBin(unsigned long long int);
};

So the problem that I get is that my query time is very different depending on the table size (it has nothing to do a table with 100 rows and 100 columns, and a table with 10000 rows and 1000 columns). This makes that my code performance is very low for big tables, what really matters me... Do you have any idea how could I optimizate my code????

Thank you very much for all your help!!! :)

thomas
  • 151
  • 1
  • 1
  • 9
  • 1
    TLDR ;) - Why can't you just use [SQlite](http://www.sqlite.org/)? – Björn Pollex Mar 22 '11 at 09:30
  • kudos to you for making your own database. If you want to make it faster use an index. The index is kept in a separate file and you update it when the rows are updated. keywords to look up B+-tree or B-tree to start with. (If you want some history you're writing DBase - late 80's I think.) – daven11 Mar 22 '11 at 11:05

5 Answers5

3

Whenever you have performance problems, the first thing you want to do is to profile your code. Here is a list of free tools that can do that on windows, and here for linux. Profile your code, identify the bottlenecks, and then come back and ask a specific question.

Also, like I said in my comment, can't you just use SQLite? It supports in-memory databases, making it suitable for testing, and it is lightweight and fast.

Community
  • 1
  • 1
Björn Pollex
  • 75,346
  • 28
  • 201
  • 283
  • I have been asked to do in this way. Actually I didn't know about C++ and I had to learn to do this... I will try to use profilers to see bottlenecks. Anyway, if you have time and want to have a look, I will appreciate very much your help! Thanks in advance Space_C0wb0y – thomas Mar 22 '11 at 10:16
  • @thomas: Who asked you that? Did they give any reasons. Seriously, if there are no good reasons not to use some existing database, you must convince them that this requirement is nonsense! Writing your own "database simulator" is time-consuming and error-prone and will most likely end up costing your Boss a lot of money. – Björn Pollex Mar 22 '11 at 10:33
  • I was asked that in the university. I just have to make some time experiments depending on the number of rows and column with a code that simulates a database – thomas Mar 22 '11 at 10:38
1

One obvious issue is that your get-functions return vectors by value. Do you need to have a fresh copy each time? Probably not.

If you try to return a const reference instead, you can avoid a lot of copies:

const vector<Table>& getPointer();

and similar for the nested get's.

Bo Persson
  • 90,663
  • 31
  • 146
  • 203
  • @Bo Persson Thanks Bo!! It's true! I dont need to return vectors by value, with the reference is enough and in that way I don't have to make copies of every vector, what make my code runs slower. So thanks for the suggestion. I will update my question with this improvement. But the performance is not good yet, do you have any other suggestion??? :D – thomas Mar 22 '11 at 10:25
  • You also have to change the other code to store a reference 'const Table& selectedTable = database.getPointer().at(0); ´, otherwise the assignment will copy from the reference anyway. And then add a few other `const`s to the code, like in class Table `const vector& getRowsCols() const; `. Etc! – Bo Persson Mar 22 '11 at 10:34
  • @Bo Persson. Interesting!! You are making my day! Thanks a lot! :) Let me see how it works with these changes and I will tell you... by the way, why do I have to define all the get's functions as const?? – thomas Mar 22 '11 at 10:42
  • The functions have to be const, if you use them from a const reference. – Bo Persson Mar 22 '11 at 10:45
  • @Bo Persson. Already updated! Now the program works extremely fast... I am a bit worried, I want to make some checkings to see if it works as I intended. Another question: when I indicate selectedElementSet=selectedTable.getRowsCols().at(n); how must I indicate that the value is return by reference?? – thomas Mar 22 '11 at 10:57
  • @Thomas: You can't do that there. :-) You have to bind it to a const reference, just like with `const Table&`, otherwise the assignment will create a copy from the referenced object. A reference is just like another name for an object, an alias. If you copy or assign from the reference, the original object is copied. But if you create another reference, you get a new alias for the same object, without any copying. – Bo Persson Mar 22 '11 at 11:56
  • @Bo Persson. Hey friend! I have continued with my program and it works better. But now I have another strange problems, and I would like to ask for your help, cause it was ver useful. I send you the link of my new post. I would be very grateful if you could help me again with any of your advices. Thanks in advance! http://stackoverflow.com/questions/5479013/optimizating-my-code-simulating-a-database-2 – thomas Apr 01 '11 at 12:06
0

I have not done the job, but you may analyse the complexity of your algorithm. The reference says that access an item is in constant time, but when you create loops, the complexity of your program increases:

for (i=0;i<1000; ++i) // O(i)
  for (j=0;j<1000; ++j) // O(j)
     myAction(); // Constant in your case

The program complexity is O(i*j), so how big may be i an j? What if myAction is not constant in time?

Aif
  • 11,015
  • 1
  • 30
  • 44
  • Sorry Aif, I don't really understand what you mean... can you explain with another words or with more detail?? Thanks :) – thomas Mar 22 '11 at 10:31
0

No need to reinvent the wheel again, use FirebirdSQL embedded database instead. That combined with IBPP C++ interface gives you a great foundation for any future needs.

http://www.firebirdsql.org/

http://www.ibpp.org/

ROAR
  • 1,304
  • 1
  • 11
  • 28
0

Though I advise you to please use a profiler to find out which parts of your code are worth optimizing, here is how I would write your program:

Read the entire text file into one string (or better, memory-map the file.) Scan the string once to find all | and \n (newline) characters. The result of this scan is an array of byte offsets into the string.

When the user then queries item M of row N, retrieve it with code something like this:

char* begin = text+offset[N*items+M]+1; 
char* end = text+offset[N*items+M+1];

If you know the number of records and fields before the data is read, the array of byte offsets can be a std::vector. If you don't know and must infer from the data, it should be a std::deque. This is to minimize costly memory allocation and deallocation, which I imagine is the bottleneck in such a program.