21

If I have a huge file (eg. 1TB, or any size that does not fit into RAM. The file is stored on the disk). It is delimited by space. And my RAM is only 8GB. Can I read that file in ifstream? If not, how to read a block of file (eg. 4GB)?

ZigZagZebra
  • 1,349
  • 3
  • 14
  • 25
  • How is it delimited? It is line-oriented text? Can you read a line at a time? – nicomp Jan 12 '16 at 19:11
  • 2
    @nicomp I doubt that one could have a text file of size 1 TB. – Oleg Andriyanov Jan 12 '16 at 19:12
  • There is no way. You can not put 1TB to any RAM less than that. If you want extract data from that file, it might be possible. –  Jan 12 '16 at 19:17
  • 3
    Have you tried reading it? If not, why not? If you did, what didn't work? – Cheers and hth. - Alf Jan 12 '16 at 19:19
  • @DieterLücking he doesn't seem to be asking about reading the entire file at once into memory. Is your comment suggesting that with `ifstream`, you must do exactly that? – mah Jan 12 '16 at 19:19
  • @mah Surely not, any ifstream has likely a tiny internal buffer compared to that requirements –  Jan 12 '16 at 19:22
  • How do you need to read and process it? That really helps determine the best way to solve your problem, because processing that much data in a stream is a lot different from randomly jumping around throughout the file. – Andrew Henle Jan 12 '16 at 19:53
  • @ZigZagZebra, I believe that you are struggling with the speed of reading the file, aren't you? Otherwise I don't see the problem of reading a whatever-size long file, you read a chunk and then process it again and again till you reach the end. What OS is your software supposed to be run on? – neshkeev Jan 12 '16 at 21:15
  • 1
    @OlegAndriyanov Want me to send you one? – nicomp Jan 12 '16 at 23:41
  • 1
    @nicomp Yeah, why not. `/dev/null` seems to be a good name to store it. – Oleg Andriyanov Jan 12 '16 at 23:46
  • @OlegAndriyanov That is the null device. You shouldn't store files there. Anything you write there is deleted. – nicomp Jan 13 '16 at 01:30
  • @nicomp You didn't get my joke. – Oleg Andriyanov Jan 13 '16 at 06:25

4 Answers4

32

There are a couple of things that you can do.

First, there's no problem opening a file that is larger than the amount of RAM that you have. What you won't be able to do is copy the whole file live into your memory. The best thing would be for you to find a way to read just a few chunks at a time and process them. You can use ifstream for that purpose (with ifstream.read, for instance). Allocate, say, one megabyte of memory, read the first megabyte of that file into it, rinse and repeat:

ifstream bigFile("mybigfile.dat");
constexpr size_t bufferSize = 1024 * 1024;
unique_ptr<char[]> buffer(new char[bufferSize]);
while (bigFile)
{
    bigFile.read(buffer.get(), bufferSize);
    // process data in buffer
}

Another solution is to map the file to memory. Most operating systems will allow you to map a file to memory even if it is larger than the physical amount of memory that you have. This works because the operating system knows that each memory page associated with the file can be mapped and unmapped on-demand: when your program needs a specific page, the OS will read it from the file into your process's memory and swap out a page that hasn't been used in a while.

However, this can only work if the file is smaller than the maximum amount of memory that your process can theoretically use. This isn't an issue with a 1TB file in a 64-bit process, but it wouldn't work in a 32-bit process.

Also be aware of the spirits that you're summoning. Memory-mapping a file is not the same thing as reading from it. If the file is suddenly truncated from another program, your program is likely to crash. If you modify the data, it's possible that you will run out of memory if you can't save back to the disk. Also, your operating system's algorithm for paging in and out memory may not behave in a way that advantages you significantly. Because of these uncertainties, I would consider mapping the file only if reading it in chunks using the first solution cannot work.

On Linux/OS X, you would use mmap for it. On Windows, you would open a file and then use CreateFileMapping then MapViewOfFile.

zneak
  • 134,922
  • 42
  • 253
  • 328
  • 1
    Side note: The usual mistake; not testing a stream operation: `while (bigFile) { bigFile.read(...); ... }` –  Jan 12 '16 at 20:15
  • If `bigFile.read()` reads less than the amount that is requested, then the nread amount is in `bigFile.gcount()`. Can we use this nread value to index into the buffer and continue the read-loop? – daparic May 27 '20 at 05:34
8

I am sure you don't have to keep all the file in memory. Typically one wants to read and process file by chunks. If you want to use ifstream, you can do something like that:

ifstream is("/path/to/file");
char buf[4096];
do {
    is.read(buf, sizeof(buf));
    process_chunk(buf, is.gcount());
} while(is);
Oleg Andriyanov
  • 5,069
  • 1
  • 22
  • 36
  • Is it possible that `is.read()` can sometimes only read less than `4096` bytes? – daparic May 27 '20 at 05:32
  • @typelogic yes, if end of file has been reached, or some kind of read error occurred, `read` can read fewer bytes than requested. See https://en.cppreference.com/w/cpp/io/basic_istream/read – Oleg Andriyanov May 27 '20 at 14:20
  • Sorry, my main question was wither we can continue the read loop in the middle of an interruption where nread is less than requested. Is the `ifstream` object still sane? I'm asking because an interrupted `read` is very common in C inside a loop and the `read` continues on. – daparic May 27 '20 at 16:25
  • @typelogic Short read does not indicate an error directly. If by "sane" you mean "can I continue reading from the stream", then I believe you should test the return value of [`fail()`](https://en.cppreference.com/w/cpp/io/basic_ios/fail). Also, if you are concerned with precise error handling and low level stuff like interruption by a signal, you might want to stick with C and raw POSIX `read()` — the API and the docs are much more clean and informative. – Oleg Andriyanov May 27 '20 at 17:01
  • 1
    what's the benefit of using `do...while(is)` rather than just `while(is)`? – starriet Aug 26 '22 at 15:58
  • @starriet you're right, I think there's no benefit – Oleg Andriyanov Aug 28 '22 at 16:24
3

A more advances aproach is to instead of reading whole file or its chunks to memory you can map it to memory using platform specific apis:

Under windows: CreateFileMapping(), MapViewOfFile()

Under linux: open(2) / creat(2), shm_open, mmap

you will need to compile 64bit app to make it work.

for more details see here: CreateFileMapping, MapViewOfFile, how to avoid holding up the system memory

Community
  • 1
  • 1
marcinj
  • 48,511
  • 9
  • 79
  • 100
1

You can use fread

char buffer[size];
fread(buffer, size, sizeof(char), fp);

Or, if you want to use C++ fstreams you can use read as buratino said.

Also have in mind that you can open a file regardless of its size, the idea is to open it and read it in chucks that fit in your RAM.

Community
  • 1
  • 1
marian0
  • 664
  • 6
  • 15
  • 1
    He asked about `ifstream`. I believe a more relevant function call would be to [read](http://www.cplusplus.com/reference/istream/istream/read/) – buratino Jan 12 '16 at 19:19
  • I read fread doc. So if I use `FILE * pFile; pFile = fopen ( "myfile.txt" , "rb" );` and myfile.txt cannot be fit in the RAM, can I still open it in this way? – ZigZagZebra Jan 12 '16 at 19:22
  • 2
    fopen does not load the file in ram, so yes you should be able to do it without problems. – marian0 Jan 12 '16 at 19:27
  • 1
    Doesn't C file IO has 2gb limit ? – 0xB00B Mar 23 '22 at 16:05