How to write data efficiently to thousands of different files

Question

My task is to reorganize a big (~1GB) binary file. I have to get values of different types and write them back into one single big file, transposed. The original file looks like that (V stands for Value)

V1.1,V2.1,V3.1...VX.1,V1.2,V2.2,V3.2,...VX.2... ...VX.Y

The Output file should look like this: V1.1,V1.2...V1.Y,V2.1,V2.2...VX.Y.

What i am doing now is to open a bunch of temporary files and write all V1 into the first, all V2 into the second... once i am through the original file i concatenate all temporary files.

My Limitations are:
- Memory (thats most important, 0 would be best) - Speed (my task is to do this as fast as possible)

My Problem now is: - When using Filestreams or FILE* i am limited to 2048 files per process. There might be more that 2000 values in that original file. - Using CreateFile is very, very slow.

How am i reading the data: I know how many values are in one block (i.e: V1.1 - VX.1 --> X = 1000) The Input file is a ifstream where i read the data into a vector of byte, then i write every value into a FILE* via fwrite(). Then i read the next block V1.2 - VX.2 and so on...

My question now is:

Is there a way how to handle such a situation correctly? I know i will have to compromise somehow. How can i speed up this thing without gaining too much memory footprint?

thanks in advance, Nicolas

Edit: OS is Windows XP Embedded, .NET 4.0 Edit: source File Size is ~ 1GB

Edit: My first approach was to create a skeleton file and fill it with data using fseek, but that was even slower than my current approach.

Edit: The program will run on a hard disk RAID-1 setup.

For this kind of problem, it is reasonable to take the approach of leveraging any operating system-specific facilities that can be taken advantage of. Unfortunately, you may be shocked to learn that there's more than one operating system that's used on all computers in the world. So, without stating which platform is being used here, no authoritative answer is possible. — Sam Varshavchik, Jan 04 '16 at 13:32
`fseek` may help. so you can work only with 2 files (input and output). — Jarod42, Jan 04 '16 at 13:41
You may create something like a pool of open files. Experiment with various policies w.r.t to which files to keep in the pool (e.g. most recently used, most frequently used etc.) — kfx, Jan 04 '16 at 13:42
Also important - what is the hardware? HDD or a SSD? Anything special about it (e.g. size of disk cache, etc)? — kfx, Jan 04 '16 at 13:44
@Sam Varshavchik, sarcasm does not add any value to your comment. — BlueTrin, Jan 04 '16 at 13:52
I'd probably consider looking into `sqlite` with the current, somewhat sparse spec of requirements. More here: http://stackoverflow.com/questions/93654/is-there-a-net-c-wrapper-for-sqlite Why not let something else organize the data for you? Add all the data then query the db in such a manner that you get the data back sorted in a way that suits you. It shouldn't take very long to at least try. — enhzflep, Jan 04 '16 at 13:57
From your edit, it seems that you try `fseek` for the output, I would try it for the input. — Jarod42, Jan 04 '16 at 14:10
It helps to know what this operation is. It's not a "total reorganization", it's called "transposition". — MSalters, Jan 04 '16 at 15:05

score 1 · Answer 1 · answered Jan 04 '16 at 15:22

By modern standards, 1 GB is small. You can easily afford to hold the output in main memory, as you the input sequentially.

If this is infeasible, it's good to realize that writing small bits of output is really really bad. Changing 4 bytes means reading a whole cluster, and writing it all back. Hence, you want to write as large a chunk as possible.

Say you choose a 64 kB chunksize. You know the 1GB output holds 16384 such output chunks. You therefore read the input file 16384 times, on each pass extracting the relevant values from the input destined to that particular output chunk.

Clearly the "1GB at a time" approach is just the limit case of choosing a huge chunk so you have just one pass. The most efficient approach is thus to grab the biggest possible memory block. Divide the input size by the size of that block to get the number of passes, and repeatedly read the input.

score 0 · Answer 2 · answered Jan 04 '16 at 14:12

You could use external sorting

These algorithms are specifically designed for exactly this: sort (a.k.a your rearrange) a file whose content does not fit in memory.

You should search for library implementation of a such algorithm. Software recommendations are not ontopic on this site.

score 0 · Answer 3 · answered Jan 04 '16 at 14:29

You can try modifying your algorithm like this:

Instead of having one file per value, you can have a file for lets say 10 values. Now you have 10 times less files. Now what is left is to sort each of these files. Depending of their size, you might sort them in RAM, or you can create 10 files for each values and concatenate them.

How to write data efficiently to thousands of different files

3 Answers3