C++ Something better than fgetc?

Question

I have a huge file, which I'm reading with fopen & fgetc in a loop.

It takes around 6 seconds to read the entire file with "rb" flag in fopen, there are around 25k lines in the file.

I was wondering; what are faster ways than fgetc ? is it better to first load everything in a char* array ? is strcpy better ?

note that It's better if it's the way fgetc, or if I'm able to at least get char by char in the array.
what's are better ways than fgetc ?

[fread](http://pubs.opengroup.org/onlinepubs/009695399/functions/fread.html)? — Anton Savin, Sep 24 '14 at 09:45
Generally, reading bigger chunks is better. Also take a look at `fgetc_unlocked`, if your implementation provides it. — Deduplicator, Sep 24 '14 at 09:51
[`std::getc`](http://en.cppreference.com/w/cpp/io/c/fgetc) can be implemented as a macro (which I find funny considering its namespace-qualified). This is intentional (the macro part) it has the benefit of potentially reducing the invoke-overhead unless a buffer refill is required. Worth a try to bench it (in **release** build, of course). And I concur with Deduplicator, front-lock the file and don't forget to unlock it when finished. — WhozCraig, Sep 24 '14 at 09:51
@WhozCraig: You are really absolutely sure it can be implemented as a macro (and not only as an inline function) in C++ too, and not only in C? — Deduplicator, Sep 24 '14 at 09:53
Reading a each character in a loop is going to be slower than reading the whole file into memory or even a chunk at a time. So, are you reading the whole file, or just part of it and do you need to read one character at a time? — TheDarkKnight, Sep 24 '14 at 09:57
fgetc_unlocked & std::getc gave the same result in bechmark speed, 6 seconds. — Raúl Sanpedro, Sep 24 '14 at 09:57
Perhaps this is what you really want: http://stackoverflow.com/questions/410943/reading-a-text-file-into-an-array-in-c — TheDarkKnight, Sep 24 '14 at 10:00
@Merlin069 **char \*bytes = malloc (pos);** void* cannot be used to initialize an entity of type char* — Raúl Sanpedro, Sep 24 '14 at 10:06
Well thats odd, because when I flockfile + getc() through EOF (saving in a contiguous memory buffer) + funlockfile on a 32MB disk file read in binary-mode i'm at 683ms (yes, its an SSD). removing the flockfile/funlockfile brackets pushes the same code up to 2360ms. And using `fgetc` only rachets it up another 150ms or so, so I'm with Deduplicator on this. — WhozCraig, Sep 24 '14 at 10:08
malloc returns a void*, so cast it to a char* with char* bytes = (char*)malloc(pos); — TheDarkKnight, Sep 24 '14 at 10:10
@RaúlSanpedro if you're man-handling the entire file, at least use a `std::vector`. `malloc` has no place in a modern C++ program. — WhozCraig, Sep 24 '14 at 10:11
@WhozCraig **flockfile** is unrecognized on my code, what header file is it on? — Raúl Sanpedro, Sep 24 '14 at 10:13
[**See here**](http://man7.org/linux/man-pages/man3/flockfile.3.html) — WhozCraig, Sep 24 '14 at 10:14
@WhozCraig having the buffer in a std::vector ? will it be faster than fgetc ? — Raúl Sanpedro, Sep 24 '14 at 10:14
The purpose of the vector-owned buffer is to [avoid pointers owning resources.](https://dl.dropboxusercontent.com/u/6101039/Modern%20C%2B%2B.pdf). That in the event you decide to bulk-load it. — WhozCraig, Sep 24 '14 at 10:16
On POSIX-compliant systems, mmap() will achieve the "access character by character" requirement. Let the OS deal with memory allocation and I/O buffering; those are things it is good at. — Andrew, Sep 24 '14 at 12:15

score 1 · Answer 1 · answered Sep 24 '14 at 11:20

Well fgetc is already pretty much optimized, because it uses the underlying buffering of fopen. Simply you call a function (but not a system call) for each character. You could try to increase the buffer size (as you say you are reading huge files) with setbuffer :

#define SIZE 65536
// or use even greater size if appropriate ...
char buffer[SIZE];

fd = fopen(...);
setbuffer(fd, buffer, SIZE);

Alternatively, do you need do read character per character ?

Basile Starynkevitch · Answer 2 · 2014-09-24T11:40:09.030

If the file is a textual file, it is probably made of reasonably sized lines. Then you may try to read it line by line, e.g. with std::getline (or, in C, getline(3))

If you are on a Posix system e.g. Linux, you could use low-level syscalls(2) like read(2) or mmap(2). Be sure to have large enough buffers of e.g. 16Kbytes or 64KBytes.

BTW, if on Linux, try time wc yourbigfile, it should give you an idea of a lower-bound of the time actually needed to read your file. Remember that there is a file system cache: see http://linuxatemyram.com/ for more.

^{on my Linux desktop system wc of a 6Mbytes, 100Klines file takes about 0.1 second realtime.}

Perhaps read Advanced Linux Programming, at least if you run your program on Posix systems.

^{BTW, your question is operating system and perhaps file system specific.}

score 0 · Accepted Answer · answered Oct 08 '14 at 22:52

0

The entire problem with my code was that I was using fget_pos and fset_pos everytime I wanted to "return" a char, there was ungetc which significantly increases speed!

answered Oct 08 '14 at 22:52

Raúl Sanpedro

266
4
15

C++ Something better than fgetc?

3 Answers3