What is the most efficient way to split a line of a text in C++?

Question

I'm dealing with some text files in which i need to read all the lines and I need to reach the strings in these lines. I used an approach like below(assuming there are 4 strings in each line):

string word1 , word2, word3, line;
while( getline( inputFile,line )){

    stringstream row(line);
    row>>word1>>word2>>word3>>word4;

}

However, it turned out very inefficient, my program did not run quite fast. How can I improve the method? Thanks in advance!

There's a bunch of solutions there: http://stackoverflow.com/questions/1120140/how-can-i-read-and-parse-csv-files-in-c — didierc, Jul 19 '14 at 09:48
don't read the line with getline/stringstream but use the istream directly — wonko realtime, Jul 19 '14 at 09:49
"However, it turned out very inefficient" - don't be offended when I question that. With four "strings" per line and a buffered stream that code will be near-epsilon for speed. I have a 1-million line txt file, each line containing four random strings ranging in length from 5 to 30 characters, and the above code will enumerate the entire thing in about 3 seconds (on a 3-year-old macbook air laptop, no-less). Either your problem is something with the data you're *not* sharing or it isn't in this code. A release build of the posted code should rip through a file. — WhozCraig, Jul 19 '14 at 12:00
This code was just an example thats why i said "assume 4 strings in each line". Plus, that wasn't the whole program. As title summarizes, I want to learn most efficient way to read strings in .txt lines. Therefore I wrote just the part that does the read operation — Baturay Kaya, Jul 19 '14 at 14:52
I suspected it wasn't the whole program, and that was somewhat my point. *None* of the answers below (*yet*, anyway) do what those lines of code posted do. (prohibit line-bleed, platform newline processing, etc). Could you write a hundred lines of code that circumvents the line-buffer and word-buffer allocations and copies, based on a complicated `strtok()` + `memcpy()` algorithm? Certainly. Is it worth it in the end? ask Donald Knuth. That said [**see this**](http://stackoverflow.com/a/15116163/1322972). Mats does a *fabulous* dissection of tradeoffs between complexity and performance. — WhozCraig, Jul 19 '14 at 17:36

score 0 · Answer 1 · edited May 23 '17 at 10:33

Dont use getline and string stream Read all the string in large chunks/blocks of data using read function

ifstream file ("file.txt", ios::in|ios::binary|ios::ate);
if (file.is_open())
{
    file.seekg(0, ios::end);
    int block_size = file.tellg();
    char *contents = new char [block_size];
    file.seekg (0, ios::beg);
    file.read (contents, block_size);
    file.close();

    //... now deal with the string (I/O operations take more time once the entire 
    // file is in RAM it will be faster to operate on )

    delete [] contents;
}

if your file size exceeds the limit of your heap memory you will have to read in predefined block size and operate on those and free the memory and move on to the next block

Suggestion

score 0 · Answer 2 · answered Jul 19 '14 at 10:41

I see two variants. And I compare all three variants (your, and 2 mines) on such file:

(bash)for ((i=0;i<100000;++i)); do echo "$i $i $i $i"; done > test.txt

test.txt placed in tmpfs. All timings in seconds.

Your variant: CPU time 0.130000, abs time 0.135514

My variant 1: CPU time 0.060000, abs time 0.062909,

My variant 2: CPU time 0.050000, abs time 0.052963

1)"C mode":

//FILE *in  
char buf[1000];
buf[sizeof(buf) - 1] = '\0';
char w1[sizeof(buf)];
char w2[sizeof(buf)];
char w3[sizeof(buf)];
char w4[sizeof(buf)];
while (fgets(buf, sizeof(buf) - 1, in) != nullptr) {
    *w1 = *w2 = *w3 = *w4 = '\0';
    sscanf(buf, "%s %s %s %s", w1, w2, w3, w4);//here should be check for == 4
    //words.emplace_back(std::string(w1), std::string(w2), std::string(w3), std::string(w4));
}

2)"mapped file":

//MapFile in;
const char *beg = in.begin();
const char *end = beg + file_size;
std::string w[4];
const char *ptr = beg;
bool eof = false;
do {
    for (int i = 0; i < 4; ++i) {
        const char *q = find_end_of_word(ptr, end);
        w[i].assign(ptr, q - ptr);
        if (q == end) {
            eof = true;
            break;
        }
        ptr = q;
        while (ptr != end && (*ptr == ' ' || *ptr == '\t' || *ptr == '\n'))
            ++ptr;
        if (ptr == end) {
            eof = true;
            break;
        }
    }
    //words.emplace_back(w[0], w[1], w[2], w[3]);

// printf("%s %s %s %s\n", w[0].c_str(), w[1].c_str(), w[2].c_str(), w[3].c_str()); } while (!eof);

What is the most efficient way to split a line of a text in C++?

2 Answers2