(Qt with C++)Reading a file(~25Mb) and comparing strings turn out to be slower than python

Question

I have two files, "test" & "sample". each file contains "rs-numbers" followed by "genotypes".

The test file is smaller than the sample file. only have around 150 rs-numbers+their genotypes.

However, the sample file contains more than 900k rs-numbers+their genotypes.

readTest() opens "test.tsv", read through the file by line, and returns a vector of tuples. the tuple holds (rs-number, geno type).

analyze() takes the result from readTest(), opens the sample file, read the file by line, and then do the comparison.

example from the sample file:

rs12124811\t1\t776546\tAA\r\n

rs11240777\t1\t798959\tGG\r\n

rs12124811 and rs11240777 are the rs-numbers. AA and GG are their genotypes.

the running time would be n*m. In my c++ version of this program, it takes 30 seconds while python version only takes 15 seconds and 5 seconds with multiprocessing.

vector<tuple<QString, QString>> readTest(string test){
// readTest can be done instantly
return tar_gene;
}

// tar_gene is the result from readTest()
// now call analyze. it reads through the sample.txt by line and does 
// comparison.
QString analyze(string sample_name,
         vector<tuple<QString, QString>> tar_gene
         ){

QString data_matches;

QFile file(QString::fromStdString(sample_name));
file.open(QIODevice::ReadOnly);
//skip first 20 lines
for(int i= 0; i < 20; i++){
    file.readLine();
}

while(!file.atEnd()){ // O(m)
    const QByteArray line = file.readLine();
    const QList<QByteArray> tokens = line.split('\t');
    // tar_gene is the result from readTest()
    for (auto i: tar_gene){ // O(n*m)
        // check if two rs-numbers are matched
        if (get<0>(i) == tokens[0]){
            QString i_rs = get<0>(i);
            QString i_geno = get<1>(i);
            QByteArray cur_geno = tokens[3].split('\r')[0];
            // check if their genotypes are matched
            if(cur_geno.length() == 2){
                if (i_geno == cur_geno.at(0) || i_geno == cur_geno.at(1)){
                    data_matches += i_rs + '-' + i_geno + '\n';
                    break; // rs-numbers are unique. we can safely break 
                           // the for loop
                }
            }
            // check if their genotypes are matched
            else if (cur_geno.length() == 1) {
                if (i_geno == cur_geno.at(0)){
                    data_matches += i_rs + '-' + i_geno + '\n';
                    break; // rs-numbers are unique. we can safely break 
                           // the for loop
                }
            }
        }
    }
}
return data_matches; // QString data_matches will be used in main() and 
                     // printed out in text browser
}

Here is the full source code

#include "mainwindow.h"
#include "ui_mainwindow.h"

MainWindow::MainWindow(QWidget *parent) :
    QMainWindow(parent),
    ui(new Ui::MainWindow)
{
    ui->setupUi(this);
}

MainWindow::~MainWindow()
{
    delete ui;
}

QString analyze(string sample_name,
             vector<tuple<QString, QString>> tar_gene,
             int start, int end){

    QString rs_matches, data_matches;

    QFile file(QString::fromStdString(sample_name));
    file.open(QIODevice::ReadOnly);
    //skip first 20 lines
    for(int i= 0; i < 20; i++){
        file.readLine();
    }

    while(!file.atEnd()){
        const QByteArray line = file.readLine();
        const QList<QByteArray> tokens = line.split('\t');
        for (auto i: tar_gene){
            if (get<0>(i) == tokens[0]){
                QString i_rs = get<0>(i);
                QString i_geno = get<1>(i);
                QByteArray cur_geno = tokens[3].split('\r')[0];
                if(cur_geno.length() == 2){
                    if (i_geno == cur_geno.at(0) || i_geno == cur_geno.at(1)){
                        data_matches += i_rs + '-' + i_geno + '\n';
                        break;
                    }
                }
                else if (cur_geno.length() == 1) {
                    if (i_geno == cur_geno.at(0)){
                        data_matches += i_rs + '-' + i_geno + '\n';
                        break;
                    }
                }
            }
        }
    }
    return data_matches;
}

vector<tuple<QString, QString>> readTest(string test){
    vector<tuple<QString, QString>> tar_gene;
    QFile file(QString::fromStdString(test));
    file.open(QIODevice::ReadOnly);
    file.readLine(); // skip first line
    while(!file.atEnd()){
        QString line = file.readLine();
        QStringList templist;
        templist.append(line.split('\t')[20].split('-'));
        tar_gene.push_back(make_tuple(templist.at(0),
                                      templist.at(1)));
    }
    return tar_gene;
}

void MainWindow::on_pushButton_analyze_clicked()
{
    if(ui->comboBox_sample->currentIndex() == 0){
        ui->textBrowser_rs->setText("Select a sample.");
        return;
    }
    if(ui->comboBox_test->currentIndex() == 0){
        ui->textBrowser_rs->setText("Select a test.");
        return;
    }
    string sample = (ui->comboBox_sample->currentText().toStdString()) + ".txt";
    string test = ui->comboBox_test->currentText().toStdString() + ".tsv";

    vector<tuple<QString, QString>> tar_gene;

    QFile file_test(QString::fromStdString(test));
    if (!file_test.exists()) {
        ui->textBrowser_rs->setText("The test file doesn't exist.");
        return;
    }

    tar_gene = readTest(test);

    QFile file_sample(QString::fromStdString(sample));
    if (!file_sample.exists()) {
        ui->textBrowser_rs->setText("The sample file doesn't exist.");
        return;
    }
    clock_t t1,t2;

    t1=clock();
    QString result = analyze(sample, tar_gene, 0, 0);
    t2=clock();
    float diff ((float)t2-(float)t1);

    float seconds = diff / CLOCKS_PER_SEC;
    qDebug() << seconds;
    ui->textBrowser_rsgeno->setText(result);
}

How do I make it run faster? I remade my program in c++ because I expected to see a better performance than python version!

with @Felix helps, my program takes 15 seconds now. I will try multithreading later.

here are the examples of the source data files:

(test.tsv) rs17760268-C rs10439884-A rs4911642-C rs157640-G ... and more. They are not sorted.

(sample.txt) rs12124811\t1\t776546\tAA\r\n rs11240777\t1\t798959\tGG\r\n ... and more. They are not sorted.

any recommendation for better data structure or algorithms?

To be able to say what can be improved we need to see your program, not just a part of it. — Slava, Jan 09 '19 at 21:13
@NathanOliver How do I do it? I am new to qt and c++. I always run the test by pressing "ctrl + r" in qt creator — nkforce, Jan 09 '19 at 21:19
You should start optimization by replacing `n*m` by something better. And you probably can do that in python as well. — Slava, Jan 09 '19 at 21:31
@Slava I updated some examples of the source data files. would you give me some recommendation for better data structure or algorithms? — nkforce, Jan 09 '19 at 21:57
Sure, but you need to describe what you are trying to achieve first. I am not going to reverse engineer your solution for various reasons. — Slava, Jan 10 '19 at 01:57

Felix · Accepted Answer · 2019-01-09T21:18:21.217

2

There are actually a bunch of things you can do to optimize this code.

Make shure the Qt you are using was compiled optimized for speed, not for size
Build you application in release mode. Make shure you set the correct compiler and linker flags to optimize for speed, not for size
Try to use references more. This avoids unneccessary copy operations. For example, use for(const auto &i : tar_gene). There are many more instances. Basically, try to avoid anything that is not a reference as much as possible. This also means to use std::move and rvalue references wherever possible.
Enable the QStringBuilder. It will optimize concatenation of strings. To do so, add DEFINES += QT_USE_QSTRINGBUILDER to your pro file.
Use either QString or QByteArray for your whole code. Mixing them means that Qt has to convert them everytime you compare them.

These are just the most basic and simplest things you can do. Try them and see how much speed you can gain. If it's still not enough, you can either try to further optimize the mathematical algorithm you are implementing here or look deeper into C/C++ to learn all the small tricks you can do for more speed.

EDIT: You can also try to gain speed by splitting the code up on multiple threads. Doing this by hand is not a good idea - have a look at QtConcurrent if you are interested in that.

edited Jan 09 '19 at 21:18

answered Jan 09 '19 at 21:15

Felix

6,885
1
29
54

First and foremost optimization is to look into data structure and algorithms. What you mention are mostly microoptimizations. – Slava Jan 09 '19 at 21:16
You are right - optimizing the actual structures would propably have a greater effect, but one should not neglect those, as many small things can have a big effect, especially when they happen often – Felix Jan 09 '19 at 21:31
omg. I tried "3" and now it only takes 15 seconds!! Thanks a lot! I will google and see why this causes the problem. – nkforce Jan 09 '19 at 21:33
can you explain more about 1 & 2 ? I just googled and found this post: [link](https://stackoverflow.com/questions/12785364/configuration-for-optimizing-qtcreator-compiler/12867971#12867971) does this post relate to 1&2? – nkforce Jan 09 '19 at 21:35
It depends on what compiler you are using. For gcc those are indeed the flags you need. And the problem solved by 3 is that using values means that every value is copied from the vector. Using references instead does not copy anthing. – Felix Jan 09 '19 at 21:57
Try to read some books/post about "references" in c++, that should give you an idea what you are dealing with. – Felix Jan 09 '19 at 21:58
@Felix I updated my post with some examples of the source data files. Could you give me some recommendation for better data structure or algorithms? – nkforce Jan 09 '19 at 22:29

(Qt with C++)Reading a file(~25Mb) and comparing strings turn out to be slower than python

1 Answers1