I was reading sehe's answer for fast text file reading in C++, which looks like this.
static uintmax_t wc(char const *fname)
{
static const auto BUFFER_SIZE = 16*1024;
int fd = open(fname, O_RDONLY);
if(fd == -1)
handle_error("open");
/* Advise the kernel of our access pattern. */
posix_fadvise(fd, 0, 0, 1); // FDADVICE_SEQUENTIAL
char buf[BUFFER_SIZE + 1];
uintmax_t lines = 0;
while(size_t bytes_read = read(fd, buf, BUFFER_SIZE))
{
if(bytes_read == (size_t)-1)
handle_error("read failed");
if (!bytes_read)
break;
for(char *p = buf; (p = (char*) memchr(p, '\n', (buf + bytes_read) - p)); ++p)
++lines;
}
return lines;
}
This is cool, but I was wondering if a similar approach can be taken when we aren't dealing with a character operation like counting newlines, but want to operate on each line of data. Say for instance I had a file of doubles, and already some function parse_line_to_double
to use on each line.
12.44243
4242.910
...
That is, how can I read BUFFER_SIZE
bytes into my buffer but avoid splitting the last line read? Effectively, can I ask "Give me BUFFER_SIZE
or less bytes while ensuring that the last byte read is a newline character (or EOF)"?
Knowing extremely little about low level IO like this, ideas that came to mind were
- Can I "back up"
fd
to the most recent newline between iterations? - Do I have to keep a second buffer holding a copy of the current line being read all the time?