Reading file into buffer and avoiding splitting lines between reads

Question

I was reading sehe's answer for fast text file reading in C++, which looks like this.

static uintmax_t wc(char const *fname)
{
    static const auto BUFFER_SIZE = 16*1024;
    int fd = open(fname, O_RDONLY);
    if(fd == -1)
        handle_error("open");

    /* Advise the kernel of our access pattern.  */
    posix_fadvise(fd, 0, 0, 1);  // FDADVICE_SEQUENTIAL

    char buf[BUFFER_SIZE + 1];
    uintmax_t lines = 0;

    while(size_t bytes_read = read(fd, buf, BUFFER_SIZE))
    {
        if(bytes_read == (size_t)-1)
            handle_error("read failed");
        if (!bytes_read)
            break;

        for(char *p = buf; (p = (char*) memchr(p, '\n', (buf + bytes_read) - p)); ++p)
            ++lines;
    }

    return lines;
}

This is cool, but I was wondering if a similar approach can be taken when we aren't dealing with a character operation like counting newlines, but want to operate on each line of data. Say for instance I had a file of doubles, and already some function parse_line_to_double to use on each line.

12.44243
4242.910
...

That is, how can I read BUFFER_SIZE bytes into my buffer but avoid splitting the last line read? Effectively, can I ask "Give me BUFFER_SIZE or less bytes while ensuring that the last byte read is a newline character (or EOF)"?

Knowing extremely little about low level IO like this, ideas that came to mind were

Can I "back up" fd to the most recent newline between iterations?
Do I have to keep a second buffer holding a copy of the current line being read all the time?

If you need to operate on a line at a time, read a line at a time using `getline`. If you need to read a double at a time then perhaps formatted input would be better. — Retired Ninja, Nov 24 '17 at 19:31
"That is, how can I read BUFFER_SIZE bytes into my buffer but avoid splitting the last line read? " You can't, really. You could Read a byte-read a byte-read a byte-byte-byte... until you find EOL, but this is stupid. Don't do it. Read into the buffer to get out of file-space and then parse the buffer. When you need more data in the buffer to finish a line, get more data from the file. How you deal with the data that was in the buffer, whole bunch of strategies to deal with that. I'm usually more interested in correctness than speed, so I'd use a `std::stringstream`. — user4581301, Nov 24 '17 at 19:38
@user4581301 Thanks for the info. Given that I'm a total beginner in this area, would you happen to know where I could be pointed to learn more about those different strategies? — Eric Hansen, Nov 24 '17 at 19:54
Not off the top of my head I can't. I start with a websearch for "file buffering strategies" and see what comes up. Multi-buffering might be of interest to you. I've probably got the wrong name for it, but consider having a series of buffers like a `std::deque` so that you can present more than one buffer as one continuous flow of data in the event of one line being more than you can fit in one or two buffers. — user4581301, Nov 24 '17 at 20:01
That said, start with the stringstream or just reading your input as Retired Ninja suggests and see if it's fast enough for you. It probably is. — user4581301, Nov 24 '17 at 20:03

score 0 · Answer 1 · answered Nov 25 '17 at 07:54

Here is a comparison test. First, lets try the easy way. Just read the file with standard C++ functions:

#include <iostream>
#include <string>  
#include <fstream> //std::ifstream
#include <sstream> //std::stringstream

uintmax_t test1(char const *fname)
{
    std::ifstream fin(fname);
    if(!fin) return 0;
    uintmax_t lines = 0;
    std::string str;
    double value;
    while(fin >> value)
    {
        //std::cout << value << "\n";
        lines++;
    }
    return lines;
}

Next, with std::stringstream this is about 2.5 times faster:

uintmax_t test2(char const *fname)
{
    std::ifstream fin(fname);
    if(!fin) return 0;
    uintmax_t lines = 0;
    std::string str;
    double value;
    std::stringstream ss;
    ss << fin.rdbuf();
    while(ss >> value)
        lines++;
    return lines;
}

Next, lets read the whole file in to memory. This will be fine as long as the file is less than 1 GiB or so. Assuming there is a double value on each line, lets extract that value. test3 is more complicated and less flexible, and it's not any faster than test2:

uintmax_t test3(char const *fname)
{
    std::ifstream fin(fname, std::ios::binary);
    if(!fin) return 0;

    fin.seekg(0, std::ios::end);
    size_t filesize = (size_t)fin.tellg();
    fin.seekg(0);
    std::string str(filesize, 0);
    fin.read(&str[0], filesize);

    double value;
    uintmax_t lines = 0;
    size_t beg = 0;
    size_t i;
    size_t len = str.size();
    for(i = 0; i < len; i++)
    {
        if(str[i] == '\n' || i == len - 1)
        {
            try
            {
                value = std::stod(str.substr(beg, i - beg));
                //std::cout << value << "\n";
                beg = i + 1;
                lines++;
            }
            catch(...)
            {
            }
        }
    }
    return lines;
}

For comparison to the wc function in the question, let's read the whole file in to memory and only count the number of lines. This runs a little faster than wc (as expected), suggesting that there is no need for additional optimizations

uintmax_t test_countlines(char const *fname)
{
    std::ifstream fin(fname, std::ios::binary);
    if(!fin) return 0;
    fin.seekg(0, std::ios::end);
    size_t filesize = (size_t)fin.tellg();
    fin.seekg(0);
    std::string str(filesize, 0);
    fin.read(&str[0], filesize);
    uintmax_t lines = 0;
    for(auto &c : str)
        if(c == '\n')
            lines++;
    return lines;
}

Reading file into buffer and avoiding splitting lines between reads

1 Answers1