How to process large files (4+ GB) correctly and portably in C++?

Question

In my career I never had to deal with files this large as the central aspect of my application and I am a bit stumped.

My application will record a lot of things in a file and it might grow to several giga bytes. The application will have to run on Windows (7 onward) and all modern Linux (kernel 4 onward), for both 32 bit and 64 bit machines

I tried using fread and friends since they are portable but they all use long. I mused splitting the file into multiple 2 GB files so as to continue working with signed long, but this seems a bit too much processing overhead.

Is there a set of portable library functions that will help with the task, or do I have to write a thin wrapper around the platform API?

EDIT: To answer some questions in the comment.

This is a binary data file.
I don't need to load it all in memory but will have to seek to offsets.
About using fstream, it seems it wouldn't work based on this, so I didn't factor it in.

Have you experienced a problem with `std::fstream` and co with files of this size? — Lightness Races in Orbit, Jan 16 '19 at 12:59
Do you have to read the whole thing into memory at once? If not I think you're totally correct to want to split the files up. — Tom, Jan 16 '19 at 13:00
https://stackoverflow.com/a/30610742/560648 https://stackoverflow.com/q/32080076/560648 https://stackoverflow.com/q/34751873/560648 https://stackoverflow.com/q/7135598/560648 et al — Lightness Races in Orbit, Jan 16 '19 at 13:00
@Tom I disagree - splitting the files up on the filesystem should ideally not be necessary. Of course you wouldn't read the whole file into memory at once but you shouldn't do that anyway — Lightness Races in Orbit, Jan 16 '19 at 13:01
@LightnessRacesinOrbit I hear what your saying - there are of course many ways to process files in chunks i.e using buffers, paging etc — Tom, Jan 16 '19 at 13:02
What data are you writing to the file? Is it a text or a "binary" file? Do you need to read *all* of the file or just part of it? Can you read and process the data from the file in smaller chunks? And what are you doing with the file once written, what are the files for? — Some programmer dude, Jan 16 '19 at 13:03
Agreed, never had an issue with stream on binary files on clusters, I don't think anyone had such a problem. — Matthieu Brucher, Jan 16 '19 at 13:09
even fread can on 64bits systems (size_t is 64bits in that case). — Matthieu Brucher, Jan 16 '19 at 13:11
What about mapping the entire file to memory? Only accessed memory region of the mapping will actually cause a page fault, so that the system will only reserve the physical memory necessary for your application. What is provided by the stl is really primitive compared to what provides the system. — Oliv, Jan 16 '19 at 13:17
@Oliv When I said 'use platform API' I meant even this memory mapping you suggest. I want to get there only as the last resort. — vin, Jan 16 '19 at 13:20
@vin It is simpler to use than you imagine. Maybe simpler than the standard library. And the way more efficient if used correctly. Actualy I do not see any reason to use the standard library or the c library for files stored on block devices. — Oliv, Jan 16 '19 at 13:25
Programming is also important, but you need to be careful not to use FAT32 format DISK (USB memory etc.). The max size of the volume can be up to 2TB, but the max size of each file is up to 4GB. — kunif, Jan 16 '19 at 16:39

score 0 · Answer 1 · answered Jan 16 '19 at 15:38

I don’t think you can do portably, but this workaround might help (untested).

#include <stdint.h>
#include <stdio.h>
#ifdef _MSC_VER
inline int seek64( FILE *stream, int64_t offset, int origin )
{
    return _fseeki64( stream, offset, origin );
}
#else
#define _LARGEFILE64_SOURCE
#include <sys/types.h>
#include <unistd.h>
inline int seek64( FILE *stream, int64_t offset, int origin )
{
    const int fd = fileno( stream );
    return (int)lseek64( fd, offset, origin );
}
#endif

How to process large files (4+ GB) correctly and portably in C++?

1 Answers1