How to read blocks of data from a file and then read from that block into a vector?

Question

Suppose I have a file which has x records. One 'block' holds m records. Total number of blocks in file n=x/m. If I know the size of one record, say b bytes (size of one block = b*m), I can read the complete block at once using system command read() (is there any other method?). Now, how do I read each record from this block and put each record as a separate element into a vector.

The reason why I want to do this in the first place is to reduce the disk i/o operations. As the disk i/o operations are much more expensive according to what I have learned. Or will it take the same amount of time as when I read record by record from file and directly put it into vectors instead of reading block by block? On reading block by block, I will have only n disk I/O's whereas x I/O's if I read record by record.

Thanks.

score 4 · Accepted Answer · answered Feb 28 '13 at 06:13

You should consider using mmap() instead of reading your files using read().

What's nice about mmap is that you can treat file contents as simply mapped into your process space as if you already had a pointer into the file contents. By simply inspecting memory contents and treating it as an array, or by copying data using memcpy() you will implicitly perform read operations, but only as necessary - operating system virtual memory subsystem is smart enough to do it very efficiently.

The only possible reason to avoid mmap maybe if you are running on 32-bit OS and file size exceeds 2 gigabytes (or slightly less than that). In this case OS may have trouble allocating address space to your mmap-ed memory. But on 64-bit OS using mmap should never be a problem.

Also, mmap can be cumbersome if you are writing a lot of data, and size of the data is not known upfront. Other than that, it is always better and faster to use it over the read.

Actually, most modern operating systems rely on mmap extensively. For example, in Linux, to execute some binary, your executable is simply mmap-ed and executed from memory as if it was copied there by read, without actually reading it.

score 3 · Answer 2 · answered Feb 28 '13 at 05:57

3

Reading a block at a time won't necessarily reduce the number of I/O operations at all. The standard library already does buffering as it reads data from a file, so you do not (normally) expect to see an actual disk input operation every time you attempt to read from a stream (or anything close).

It's still possible reading a block at a time would reduce the number of I/O operations. If your block is larger than the buffer the stream uses by default, then you'd expect to see fewer I/O operations used to read the data. On the other hand, you can accomplish the same by simply adjusting the size of buffer used by the stream (which is probably a lot easier).

answered Feb 28 '13 at 05:57

Jerry Coffin

476,176
80
629
1,111

Hi, thanks for the response. So just to be clear, you mean to say that the first time I call "getline(file_pointer,str)", C++ has implicitly already brought the block that contains this line from hdd to memory? And hence, the second time I read a line, no disk I/O would be done but rather the line would be read from the main memory (given that both the lines lie in the same block)? – Paagalpan Feb 28 '13 at 08:36
Also, the block size can be as big as 2-3 mb's. Are buffer sizes as large? How do we increase buffer size? – Paagalpan Feb 28 '13 at 11:44
@NikharAgrawal: Yes, when you call getline it'll probably read something like 8K of data from the disk, then send you the (say) 60 bytes that make up the line. And yes, it won't issue another read from the disk until that 8K is used up. But also yes, it'll probably be on the order of 4-16KB, not 2-3 MB. – Jerry Coffin Feb 28 '13 at 14:06
Thanks. :) Any idea on how do we actually increase buffer size in c++? – Paagalpan Feb 28 '13 at 20:03
1

@NikharAgrawal: Ooop, sorry. `yourstream.rdbuf()->pubsetbuf(buffer, sizeof(buffer));` – Jerry Coffin Feb 28 '13 at 20:37
Thanks again. :) Just another query. I assume 'buffer' in the above statement is a character array. Supposing that one of my record lies in the 995-1005 bytes region and my buffer size is 1000. So basically, the last record doesn't completely come in this buffer. If I wish to fetch this record, will it be fetched correctly? Also, are there any disadvantages associated with large buffer size? – Paagalpan Mar 01 '13 at 01:30
Hmm...getting some interesting results. Using setvbuf in c code, too small a buffer size slowed down the prgram quite a bit, moderate and very large buffer size produced approximately the same time. In c++, using rdbuf and pubsetbuf, changing buffer size doesn't seem to make any difference at all. – Paagalpan Mar 01 '13 at 02:47
@NikharAgrawal: Yes, there can be a disadvantage of a huge buffer. If you fill a huge buffer, then use only a small fraction of it, you'll have spent a lot of extra time reading data you never use. As far as larger buffer making no difference: little hard to be sure what's going on without seeing some source code. – Jerry Coffin Mar 01 '13 at 03:45
Oh...I see. Thanks again. Here's the question that lists my results and the source code http://stackoverflow.com/questions/15150451/cant-make-sense-of-the-varying-results-of-experiments-with-buffer-sizes-in-c-an – Paagalpan Mar 01 '13 at 04:25

How to read blocks of data from a file and then read from that block into a vector?

2 Answers2

Linked