0

I have a huge file, which I'm reading with fopen & fgetc in a loop.

It takes around 6 seconds to read the entire file with "rb" flag in fopen, there are around 25k lines in the file.

I was wondering; what are faster ways than fgetc ? is it better to first load everything in a char* array ? is strcpy better ?

  • note that It's better if it's the way fgetc, or if I'm able to at least get char by char in the array.

  • what's are better ways than fgetc ?

Raúl Sanpedro
  • 266
  • 4
  • 15
  • [fread](http://pubs.opengroup.org/onlinepubs/009695399/functions/fread.html)? – Anton Savin Sep 24 '14 at 09:45
  • 1
    First of all, C or C++? – JBL Sep 24 '14 at 09:45
  • Title says C++, so C++ – Raúl Sanpedro Sep 24 '14 at 09:47
  • 3
    Generally, reading bigger chunks is better. Also take a look at `fgetc_unlocked`, if your implementation provides it. – Deduplicator Sep 24 '14 at 09:51
  • [`std::getc`](http://en.cppreference.com/w/cpp/io/c/fgetc) can be implemented as a macro (which I find funny considering its namespace-qualified). This is intentional (the macro part) it has the benefit of potentially reducing the invoke-overhead unless a buffer refill is required. Worth a try to bench it (in **release** build, of course). And I concur with Deduplicator, front-lock the file and don't forget to unlock it when finished. – WhozCraig Sep 24 '14 at 09:51
  • @WhozCraig: You are really absolutely sure it can be implemented as a macro (and not only as an inline function) in C++ too, and not only in C? – Deduplicator Sep 24 '14 at 09:53
  • Reading a each character in a loop is going to be slower than reading the whole file into memory or even a chunk at a time. So, are you reading the whole file, or just part of it and do you need to read one character at a time? – TheDarkKnight Sep 24 '14 at 09:57
  • fgetc_unlocked & std::getc gave the same result in bechmark speed, 6 seconds. – Raúl Sanpedro Sep 24 '14 at 09:57
  • @RaúlSanpedro on a release build? (gotta ask). – WhozCraig Sep 24 '14 at 09:58
  • 1
    Perhaps this is what you really want: http://stackoverflow.com/questions/410943/reading-a-text-file-into-an-array-in-c – TheDarkKnight Sep 24 '14 at 10:00
  • @WhozCraig Yes, release build. – Raúl Sanpedro Sep 24 '14 at 10:02
  • @Merlin069 **char \*bytes = malloc (pos);** void* cannot be used to initialize an entity of type char* – Raúl Sanpedro Sep 24 '14 at 10:06
  • Well thats odd, because when I flockfile + getc() through EOF (saving in a contiguous memory buffer) + funlockfile on a 32MB disk file read in binary-mode i'm at 683ms (yes, its an SSD). removing the flockfile/funlockfile brackets pushes the same code up to 2360ms. And using `fgetc` only rachets it up another 150ms or so, so I'm with Deduplicator on this. – WhozCraig Sep 24 '14 at 10:08
  • malloc returns a void*, so cast it to a char* with char* bytes = (char*)malloc(pos); – TheDarkKnight Sep 24 '14 at 10:10
  • 2
    @RaúlSanpedro if you're man-handling the entire file, at least use a `std::vector`. `malloc` has no place in a modern C++ program. – WhozCraig Sep 24 '14 at 10:11
  • @WhozCraig **flockfile** is unrecognized on my code, what header file is it on? – Raúl Sanpedro Sep 24 '14 at 10:13
  • [**See here**](http://man7.org/linux/man-pages/man3/flockfile.3.html) – WhozCraig Sep 24 '14 at 10:14
  • @WhozCraig having the buffer in a std::vector ? will it be faster than fgetc ? – Raúl Sanpedro Sep 24 '14 at 10:14
  • The purpose of the vector-owned buffer is to [avoid pointers owning resources.](https://dl.dropboxusercontent.com/u/6101039/Modern%20C%2B%2B.pdf). That in the event you decide to bulk-load it. – WhozCraig Sep 24 '14 at 10:16
  • On POSIX-compliant systems, mmap() will achieve the "access character by character" requirement. Let the OS deal with memory allocation and I/O buffering; those are things it is good at. – Andrew Sep 24 '14 at 12:15

3 Answers3

1

Well fgetc is already pretty much optimized, because it uses the underlying buffering of fopen. Simply you call a function (but not a system call) for each character. You could try to increase the buffer size (as you say you are reading huge files) with setbuffer :

#define SIZE 65536
// or use even greater size if appropriate ...
char buffer[SIZE];

fd = fopen(...);
setbuffer(fd, buffer, SIZE);

Alternatively, do you need do read character per character ?

Serge Ballesta
  • 143,923
  • 11
  • 122
  • 252
0

If the file is a textual file, it is probably made of reasonably sized lines. Then you may try to read it line by line, e.g. with std::getline (or, in C, getline(3))

If you are on a Posix system e.g. Linux, you could use low-level syscalls(2) like read(2) or mmap(2). Be sure to have large enough buffers of e.g. 16Kbytes or 64KBytes.

BTW, if on Linux, try time wc yourbigfile, it should give you an idea of a lower-bound of the time actually needed to read your file. Remember that there is a file system cache: see http://linuxatemyram.com/ for more.

on my Linux desktop system wc of a 6Mbytes, 100Klines file takes about 0.1 second realtime.

Perhaps read Advanced Linux Programming, at least if you run your program on Posix systems.

BTW, your question is operating system and perhaps file system specific.

Basile Starynkevitch
  • 223,805
  • 18
  • 296
  • 547
0

The entire problem with my code was that I was using fget_pos and fset_pos everytime I wanted to "return" a char, there was ungetc which significantly increases speed!

Raúl Sanpedro
  • 266
  • 4
  • 15