0

My program download files from site (via curl per 30 min). (it is possible that size of these files can reach 150 mb)

So i thought that getting data from these files can be inefficient. (search a line per 5 seconds)

These files can have ~10.000 lines

To parse this file (values are seperate by ",") i use regex :

regex wzorzec("(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*)");

There are 8 values.

Now i have to push it to vector:

allys.push_back({ std::stoi(std::string(wynik[1])), nick, tag, stoi(string(wynik[4])), stoi(string(wynik[5])), stoi(string(wynik[6])), stoi(string(wynik[7])), stoi(string(wynik[8])) });

I use std::async to do that, but for 3 files (~7 mb) procesor jumps to 80% and operation take about 10 secs. I read from SSD so this is not slowly IO fault. I'm reading data line per line by fstream

How to boost this operation? Maybe i have to parse this values, and push it to SQL ?

Best Regards

Thomas Banderas
  • 1,681
  • 1
  • 21
  • 43
  • See [How to split a string in C++?](http://stackoverflow.com/q/236129/33499) for alternatives of the regex to parse a line – wimh Aug 09 '14 at 08:48
  • 6
    Regular Expression can cause your processor jumps to 80% do not use regex for this. – smali Aug 09 '14 at 08:52
  • Is your regex wrong? '.*' should match greedily in most regex parsing engine, and that will include the comma after it. – Siyuan Ren Aug 09 '14 at 09:14
  • 1
    @C.R. that regular expression looks OK, but all those greedy quantifiers will make it very inefficient. – Richard Aug 09 '14 at 09:45

2 Answers2

2

You can probably get some performance boost by avoiding regex, and use something along the lines of std::strtok, or else just hard-code a search for commas in your data. Regex has more power than you need just to look for commas. Next, if you use vector::reserve before you begin a sequence of push_back for any given vector, you will save a lot of time in both reallocation and moving memory around. If you are expecting a large vector, reserve room for it up front.

This may not cover all available performance ideas, but I'd bet you will see an improvement.

Logicrat
  • 4,438
  • 16
  • 22
2

Your problem here is most likely additional overhead introduced by the regular expression, since you're using many variable length and greedy matches (the regex engine will try different alignments for the matches to find the largest matching result).

Instead, you might want to try to manually parse the lines. There are many different ways to achieve this. Here's one quick and dirty example (it's not flexible and has quite some duplicate code in there, but there's lots of room for optimization). It should explain the basic idea though:

#include <iostream>
#include <sstream>
#include <cstdlib>

const char *input = "1,Mario,Stuff,4,5,6,7,8";

struct data {
    int id;
    std::string nick;
    std::string tag;
} myData;

int main(int argc, char **argv){
    char buffer[256];
    std::istringstream in(input);

    // Read an entry and convert/store it:
    in.get(buffer, 256, ','); // read
    myData.id = atoi(buffer); // convert and store
    // Skip the comma
    in.seekg(1, std::ios::cur);

    // Read the next entry and convert/store it:
    in.get(buffer, 256, ','); // read
    myData.nick = buffer; // store
    // Skip the comma
    in.seekg(1, std::ios::cur);

    // Read the next entry and convert/store it:
    in.get(buffer, 256, ','); // read
    myData.tag = buffer; // store
    // Skip the comma
    in.seekg(1, std::ios::cur);

    // Some test output
    std::cout << "id: " << myData.id << "\nnick: " << myData.nick << "\ntag: " << myData.tag << std::endl;
    return 0;
}

Note that there isn't any error handling in case entries are too long or too short (or broken in some other way).

Console output:

id: 1
nick: Mario
tag: Stuff
Mario
  • 35,726
  • 5
  • 62
  • 78