how to read extreme long lines from text file fast and safe in C++?

Question

There is a large text file of 6.53 GiB. Each line of it can be a data line or comment line. Comment lines are usually short, less than 80 characters, while a data line contains more than 2 million characters and is variable-length.

Considering each data line needs to be dealt with as a unit, is there a simple way to read lines safe and fast in C++?

safe (safe for variable-length data lines): The solution is as easy to use as std::getline(). Since the length is changing, it is hoped to avoid extra memory management.

fast: The solution can achieve as fast as readline() in python 3.6.0, or even as fast as fgets() of stdio.h.

A Pure C solution is welcomed. The interface for further processing is provided both in C and C++.

UPDATE 1: Thanks to short but invaluable comment from Basile Starynkevitch, the perfect solution comes up: POSIX getline(). Since further processing only involves converting from character to number and does not use many features of string class, a char array would be sufficient in this application.

UPDATE 2: Thanks to comments from Zulan and Galik, who both report comparable performance among std::getline(), fgets() and POSIX getline(), another possible solution is to use a better standard library implementation such as libstdc++. Moreover, here is a report claiming that the Visual C++ and libc++ implementations of std::getline is not well optimised.

Moving from libc++ to libstdc++ changes the results a lot. With libstdc++ 3.4.13 / Linux 2.6.32 on a different platform, POSIX getline(), std::getline() and fgets() show comparable performance. At the beginning, codes were run under the default settings of clang in Xcode 8.3.2 (8E2002), thus libc++ is used.

More details and some efforts (very long):

getline() of <string> can handle arbitrary long lines but is a bit slow. Is there an alternative in C++ for readline() in python?

// benchmark on Mac OS X with libc++ and SSD:
readline() of python                         ~550 MiB/s

fgets() of stdio.h, -O0 / -O2               ~1100 MiB/s

getline() of string, -O0                      ~27 MiB/s
getline() of string, -O2                     ~150 MiB/s
getline() of string + stack buffer, -O2      ~150 MiB/s

getline() of ifstream, -O0 / -O2             ~240 MiB/s
read() of ifstream, -O2                      ~340 MiB/s

wc -l                                        ~670 MiB/s

cat data.txt | ./read-cin-unsync              ~20 MiB/s

getline() of stdio.h (POSIX.1-2008), -O0    ~1300 MiB/s

Speeds are rounded very roughly, only to show the magnitude, and all code blocks are run several times to assure that the values are representative.
'-O0 / -O2' means the speeds are very similar for both optimization levels
Codes are shown as follows.

readline() of python

# readline.py

import time
import os

t_start = time.perf_counter()

fname = 'data.txt'
fin = open(fname, 'rt')

count = 0

while True:
    l = fin.readline()
    length = len(l)
    if length == 0:     # EOF
        break
    if length > 80:     # data line
        count += 1

fin.close()

t_end = time.perf_counter()
time = t_end - t_start

fsize = os.path.getsize(fname)/1024/1024   # file size in MiB
print("speed: %d MiB/s" %(fsize/time))
print("reads %d data lines" %count)

# run as `python readline.py` with python 3.6.0

fgets() of stdio.h

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <string.h>

int main(int argc, char* argv[]){
  clock_t t_start = clock();

  if(argc != 2) {
    fprintf(stderr, "needs one input argument\n");
    return EXIT_FAILURE;
  }

  FILE* fp = fopen(argv[1], "r");
  if(fp == NULL) {
    perror("Failed to open file");
    return EXIT_FAILURE;
  }

  // maximum length of lines, determined previously by python
  const int SIZE = 1024*1024*3;
  char line[SIZE];

  int count = 0;
  while(fgets(line, SIZE, fp) == line) {
    if(strlen(line) > 80) {
      count += 1;
    }
  }

  clock_t t_end = clock();

  const double fsize = 6685;  // file size in MiB

  double time = (t_end-t_start) / (double)CLOCKS_PER_SEC;

  fprintf(stdout, "takes %.2f s\n", time);
  fprintf(stdout, "speed: %d MiB/s\n", (int)(fsize/time));
  fprintf(stdout, "reads %d data lines\n", count);

  return EXIT_SUCCESS;
}

getline() of <string>

// readline-string-getline.cpp
#include <string>
#include <fstream>
#include <iostream>
#include <ctime>
#include <cstdlib>

using namespace std;

int main(int argc, char* argv[]) {
  clock_t t_start = clock();

  if(argc != 2) {
    fprintf(stderr, "needs one input argument\n");
    return EXIT_FAILURE;
  }

  // manually set the buffer on stack
  const int BUFFERSIZE = 1024*1024*3;   // stack on my platform is 8 MiB
  char buffer[BUFFERSIZE];
  ifstream fin;
  fin.rdbuf()->pubsetbuf(buffer, BUFFERSIZE);
  fin.open(argv[1]);

  // default buffer setting
  // ifstream fin(argv[1]);

  if(!fin) {
    perror("Failed to open file");
    return EXIT_FAILURE;
  }

  // maximum length of lines, determined previously by python
  const int SIZE = 1024*1024*3;
  string line;
  line.reserve(SIZE);

  int count = 0;
  while(getline(fin, line)) {
    if(line.size() > 80) {
      count += 1;
    }
  }

  clock_t t_end = clock();

  const double fsize = 6685;  // file size in MiB

  double time = (t_end-t_start) / (double)CLOCKS_PER_SEC;

  fprintf(stdout, "takes %.2f s\n", time);
  fprintf(stdout, "speed: %d MiB/s\n", (int)(fsize/time));
  fprintf(stdout, "reads %d data lines\n", count);

  return EXIT_SUCCESS;
}

getline() of ifstream

// readline-ifstream-getline.cpp
#include <fstream>
#include <iostream>
#include <ctime>
#include <cstdlib>

using namespace std;

int main(int argc, char* argv[]) {
  clock_t t_start = clock();

  if(argc != 2) {
    fprintf(stderr, "needs one input argument\n");
    return EXIT_FAILURE;
  }

  ifstream fin(argv[1]);
  if(!fin) {
    perror("Failed to open file");
    return EXIT_FAILURE;
  }

  // maximum length of lines, determined previously by python
  const int SIZE = 1024*1024*3;
  char line[SIZE];

  int count = 0;
  while(fin.getline(line, SIZE)) {
    if(strlen(line) > 80) {
      count += 1;
    }
  }

  clock_t t_end = clock();

  const double fsize = 6685;  // file size in MiB

  double time = (t_end-t_start) / (double)CLOCKS_PER_SEC;

  fprintf(stdout, "takes %.2f s\n", time);
  fprintf(stdout, "speed: %d MiB/s\n", (int)(fsize/time));
  fprintf(stdout, "reads %d data lines\n", count);

  return EXIT_SUCCESS;
}

read() of ifstream

// seq-read-bin.cpp
// sequentially read the file to see the speed upper bound of
// ifstream

#include <iostream>
#include <fstream>
#include <ctime>

using namespace std;


int main(int argc, char* argv[]) {
  clock_t t_start = clock();

  if(argc != 2) {
    fprintf(stderr, "needs one input argument\n");
    return EXIT_FAILURE;
  }

  ifstream fin(argv[1], ios::binary);

  const int SIZE = 1024*1024*3;
  char str[SIZE];

  while(fin) {
    fin.read(str,SIZE);
  }

  clock_t t_end = clock();
  double time = (t_end-t_start) / (double)CLOCKS_PER_SEC;

  const double fsize = 6685;  // file size in MiB

  fprintf(stdout, "takes %.2f s\n", time);
  fprintf(stdout, "speed: %d MiB/s\n", (int)(fsize/time));

  return EXIT_SUCCESS;
}

use cat, then read from cin with cin.sync_with_stdio(false)

#include <iostream>
#include <ctime>
#include <cstdlib>

using namespace std;

int main(void) {
  clock_t t_start = clock();

  string input_line;

  cin.sync_with_stdio(false);

  while(cin) {
    getline(cin, input_line);
  }

  double time = (clock() - t_start) / (double)CLOCKS_PER_SEC;

  const double fsize = 6685;  // file size in MiB

  fprintf(stdout, "takes %.2f s\n", time);
  fprintf(stdout, "speed: %d MiB/s\n", (int)(fsize/time));

  return EXIT_SUCCESS;
}

POSIX getline()

// readline-c-getline.c
#include <stdio.h>
#include <stdlib.h>
#include <time.h>

int main(int argc, char *argv[]) {

  clock_t t_start = clock();

  char *line = NULL;
  size_t len = 0;
  ssize_t nread;

  if (argc != 2) {
    fprintf(stderr, "Usage: %s <file>\n", argv[1]);
    exit(EXIT_FAILURE);
  }

  FILE *stream = fopen(argv[1], "r");
  if (stream == NULL) {
    perror("fopen");
    exit(EXIT_FAILURE);
  }

  int length = -1;
  int count = 0;
  while ((nread = getline(&line, &len, stream)) != -1) {
    if (nread > 80) {
      count += 1;
    }
  }

  free(line);
  fclose(stream);

  double time = (clock() - t_start) / (double)CLOCKS_PER_SEC;
  const double fsize = 6685;  // file size in MiB
  fprintf(stdout, "takes %.2f s\n", time);
  fprintf(stdout, "speed: %d MiB/s\n", (int)(fsize/time));
  fprintf(stdout, "reads %d data lines.\n", count);
  // fprintf(stdout, "length of MSA: %d\n", length-1);

  exit(EXIT_SUCCESS);
}

Did you check this out: https://stackoverflow.com/questions/9371238/why-is-reading-lines-from-stdin-much-slower-in-c-than-python?rq=1 ? — Rene, May 29 '17 at 11:24
Assuming a Linux system, you should also benchmark [getline(3)](http://man7.org/linux/man-pages/man3/getline.3.html) — Basile Starynkevitch, May 29 '17 at 11:24
@Olaf the tags you removed were absolutely relevant, in particular [performance]. There are clearly C++, Python, and (pure) C codes in the question. Please do not make questions more difficult to find just to satisfy your personal aversion against [C] and [C++] tagged questions. — Zulan, May 29 '17 at 11:38
http://lemire.me/blog/2012/06/26/which-is-fastest-read-fread-ifstream-or-mmap/ — James Poag, May 29 '17 at 11:40
@Rene Yes, it did not help. I tried to use `cat` the data into `stdin`, and after sync_with_stdio(false), the speed is still not so fast, about 30 MiB/s. — Eli4ph, May 29 '17 at 11:45
@Zulan: The question is "Is there a simple way to read lines safe and fast in C++?" showing code in various languages does not help, but is completely useless. So no, the tags _were_ irrelevant. ("performance" is questionable, as OP asks for a "**simple** way", too which often contradicts "fast"). — too honest for this site, May 29 '17 at 11:49
@Galik: `std::gets` was deprecated in C++11 and removed in C++14. — Zulan, May 29 '17 at 11:56
@Olaf I am sorry to make you feel uncomfortable. But since python can achieve the safety of `string` at the price of halving the speed of `fgets()`, I am wondering whether such an easy-to-use solution exits in C++. This really matters when dealing with large text file containing data. — Eli4ph, May 29 '17 at 11:58
Also this should really only be tagged `C++` as no amount of expertise in `python` or `C` can tell you what a `C++` programmer should do. — Galik, May 29 '17 at 11:59
@Eli4ph: This is not about feeling "unkomfortable", but site-rules. We are not a coding, tutoring or "find the resource" site. In case you have skipped it, take the [tour] and read [ask]. Also **please** stop spamming unrelated tags! — too honest for this site, May 29 '17 at 12:00
Olaf was 100% correct here in removing the other language tags. We don't tag questions based on what code incidentally appears in the question. We tag them based on what types of answers are expected. If you don't want an answer in Python, the Python tag is inappropriate. This is clearly a C++ question, and you want an answer in C++. You wouldn't tag it [pseudocode] just because you happened to include some pseudo-code in the question, would you? — Cody Gray - on strike, May 29 '17 at 12:06
Moving on from a discussion of tagging (take it to chat or meta)... How variable are the long lines? Is it that some are 100k and others are 2500k, or is it 2.1M, 2.0M, 1.9M?? — Martin Bonner supports Monica, May 29 '17 at 12:12
What operating system are you targeting? For both speed and simplicity, it would be worth looking into using memory-mapped files. Also, looking at your benchmarks, it is very strange that you say "-O0 / -O2". How can the speed be identical in both optimized and unoptimized builds? — Cody Gray - on strike, May 29 '17 at 12:14
If there's a Python solution that takes 0.1 second, and the best C++ solution the OP can come up with takes 0.2 seconds, then he must be using C++ inefficiently or missing a trick. — Malcolm McLean, May 29 '17 at 12:14
@CodyGray I don't understand either. Maybe for `fgets()` the bottleneck is the disk or the code is too simple to be optimized? — Eli4ph, May 29 '17 at 12:17
@MartinBonner The length can not be expected. And the only thing I am sure is that they are less than 100 million character long. Maybe I am impracticable to ask for such a solution for variable-length data line to be simple, fast whereas safe. — Eli4ph, May 29 '17 at 12:35
Simple option: Read into a buffer, if not large enough, double the buffer size and copy the already read part. Remember how large a buffer you need, and start with that next time. More complicated option: Read into ½M chunks, and then reassemble into a single string when complete. Really complicated option: Write your own immutable string-like data structure, which owns 100K chunks and read into that. The data structure can then be cheap to copy. — Martin Bonner supports Monica, May 29 '17 at 12:42
What stdlib implementation and operating system do you use? With libstdc++ 3.4.22 / Linux 4.11.2 I get the following performance for a cached file: seq-read-bin: ~5000, ifstream-getline: ~3200, readlinec: ~2800, string-getline: ~2500, python ~800. Maybe the answer is to just use a better standard library implementation. BTW: Your time calculation is wrongly rounded. Do something like `((double)t_end-t_start)/CLOCKS_PER_SEC;`. — Zulan, May 29 '17 at 14:13
Also: Have you considered just extracting the numbers from the file directly instead of extracting the whole line first? — Zulan, May 29 '17 at 14:14
@BasileStarynkevitch Thank you very much! This is what I am seeking for. Before I post, I searched a lot but failed to find such a solution being simple and perfect for my application. In case other folks have similar requirement, could you do me a favor to add the comment as an answer? — Eli4ph, May 29 '17 at 14:15
@Zulan Yes, I thought the rounding is not so important thus do it not so serious, but for fast functions like `fgets()`, it changes a lot. I will update the results soon. Also, many thanks for your advice, I will also try to implement you idea to see whether it is better for my application. — Eli4ph, May 29 '17 at 14:20
@Zulan I check CLOCKS_PER_SEC, it is 1 million on my platform, which I think is also standardized. Thus the results should not be altered very much. So I will be a little lazy to change the code only, assuming the output results will not change. — Eli4ph, May 29 '17 at 14:38
The issue is that `clock_t` and `CLOCKS_PER_SEC` are likely integrals, so you can only get a truncated integral number of seconds in `time`. — Zulan, May 29 '17 at 14:50
I have to say, using comparable data on my system, `std::fgets` takes exactly the same time as `std::getline`. — Galik, May 29 '17 at 14:53
@Zulan Oh, you are right. Then I have to rerun all the code. LOL — Eli4ph, May 29 '17 at 14:55
Also `POSIX getline()` takes the same time on my system using a `20GB` file of random sized lines with a max line size of `1024 * 1024 * 3`. — Galik, May 29 '17 at 15:02
@Zulan Also: I am using clang's default settings, and not familiar with specifying other options when compiling. I think the code is working with libc++ / darwin 16.6.0. I don't know how to get the version of libc++. — Eli4ph, May 29 '17 at 15:29
@Galik @Zulan Thanks a lot. After moving from libc++ to libstdc++, `std::fgets`, `string getline()` and `POSIX getline()` shows comparable performance. — Eli4ph, May 30 '17 at 05:07

score 6 · Answer 1 · edited Jun 20 '20 at 09:12

6

Well, the C standard library is a subset of the C++ standard library. From n4296 draft from C++ 2014 standard:

17.2 The C standard library [library.c]

The C++ standard library also makes available the facilities of the C standard library, suitably adjusted to ensure static type safety.

So provided you explain in a comment that a performance bottleneck requires it, it is perfectly fine to use fgets in a C++ program - simply you should carefully encapsulate it in an utility class, in order to preserve the OO high level structures.

edited Jun 20 '20 at 09:12

Community

1
1

answered May 29 '17 at 11:41

Serge Ballesta

143,923
11
122
252

I expected this post to be downvoted because it advises to use C library functions from C++, but I naively hoped I would get a comment... – Serge Ballesta May 29 '17 at 11:58
1

These libraries are also part of `C++` according to the standard so I don't understand the down vote. – Galik May 29 '17 at 12:01
2

The only part of this answer I find objectionable is the statement that *"you should carefully encapsulate it in an utility class, in order to preserve the OO high level structures."*. I don't see why it is necessary to wrap `fgets` in a class. C++ is a multi-paradigmatic language: not everything has to fit within the OOP paradigm. It isn't a strait-jacket. Not that there's anything especially object-oriented about a "utility" class anyway. You'd only bother with a wrapper if `fgets` could really benefit from object orientation, and I'm not sure how that would be. – Cody Gray - on strike May 29 '17 at 12:03
@CodyGray: I really prefere to uncapsulate low level optimisation to have a clean high level structure, be it in C or C++. As OP asks for a C++ program, I assumed the high level structure was object oriented. – Serge Ballesta May 29 '17 at 12:06
@CodyGray RAII! To quote Scott Meyers Effective C++: Use objects to manage resources - Resource acquisition is initialization (RAII) – Zulan May 29 '17 at 12:33
That's why the *stream* should be represented as an object, not why `fgets` should be modeled as an object. Not only does a stream have associated resources, but the corresponding file *is actually an object*. I suppose you would say that `fgets` would be a member function of the stream class, but it would work just as well as a free function. @zulan – Cody Gray - on strike May 29 '17 at 12:52

score 1 · Answer 2 · answered May 29 '17 at 12:11

1

Yes, there's a faster way to read lines and create strings.

Query the file size, then load it into a buffer. Then iterate over the buffer replacing the newlines with nuls and storing the pointer to the next line.

It will be quite a bit faster if, as is likely, your platform has a call to load a file into memory.

answered May 29 '17 at 12:11

Malcolm McLean

6,258
1
17
18

2

It's not clear that will be faster. A friend "optimized" our overlay managment code (a VAX port of Pr1me software - we are talking early 80's here) by using memory mapping. It wasn't that fast. He switched to QIOW to read from disk into a fixed bit of memory - it was much faster. Your approach will involve two passes through the data - that may be unacceptably expensive. – Martin Bonner supports Monica May 29 '17 at 12:16

Basile Starynkevitch · Accepted Answer · 2017-05-29T14:49:27.417

As I commented, on Linux & POSIX systems, you could consider using getline(3); I guess that the following could compile both as C and as C++ (assuming you do have some valid fopen-ed FILE*fil; ...)

char* linbuf = NULL; /// or nullptr in C++
size_t linsiz = 0;
ssize_t linlen = 0;

while((linlen=getline(&linbuf, &linsiz,fil))>=0) {
  // do something useful with linbuf; but no C++ exceptions
}
free(linbuf); linsiz=0;

I guess this might work (or be easily adapted) to C++. But then, beware of C++ exceptions, they should not go thru the while loop (or you should ensure that an appropriate destructor or catch is doing free(linbuf);).

Also getline could fail (e.g. if it calls a failing malloc) and you might need to handle that failure sensibly.

how to read extreme long lines from text file fast and safe in C++?

3 Answers3

Linked