Dealing with large data binary files

Question

I am working with large binary files (aprox 2 Gb each) that contain raw data. These files have a well defined structure, where each file is an array of events, and each event is an array of data banks. Each event and data bank have a structure (header, data type, etc.).

From these files, all I have to do is extract whatever data I might need, and then I just analyze and play with the data. I might not need all of the data, sometimes I just extract XType data, other just YType, etc.

I don't want to shoot myself in the foot, so I am asking for guidance/best practice on how to deal with this. I can think of 2 possibilities:

Option 1

Define a DataBank class, this will contain the actual data (std::vector<T>) and whatever structure this has.
Define a Event class, this has a std::vector<DataBank> plus whatever structure.
Define a MyFile class, this is a std::vector<Event> plus whatever structure.

The constructor of MyFile will take a std:string (name of the file), and will do all the heavy lifting of reading the binary file into the classes above.

Then, whatever I need from the binary file will just be a method of the MyFile class; I can loop through Events, I can loop through DataBanks, everything I could need is already in this "unpacked" object.

The workflow here would be like:

int main() {
    MyFile data_file("data.bin");
    std::vector<XData> my_data = data_file.getXData();
    \\Play with my_data, and never again use the data_file object
    \\...
    return 0;
}

Option 2

Write functions that take std::string as an argument, and extract whatever I need from the file e.g. std::vector<XData> getXData(std::string), int getNumEvents(std::string), etc.

The workflow here would be like:

int main() {
    std::vector<XData> my_data = getXData("data.bin");
    \\Play with my_data, and I didn't create a massive object
    \\...
    return 0;
}

Pros and Cons that I see

Option 1 seems like a cleaner option, I would only "unpack" the binary file once in the MyFile constructor. But I will have created a huge object that contains all the data from a 2 Gb file, which I will never use. If I need to analyze 20 files (each of 2 Gb), will I need 40 Gb of ram? I don't understand how these are handled, will this affect performance?

Option number 2 seems to be faster; I will just extract whatever data I need, and that's it, I won't "unpack" the entire binary file just to later extract the data I care about. The problem is that I will have to deal with the binary file structure in every function; if this ever changes, that will be a pain. I will only create objects of the data I will play with.

As you can see from my question, I don't have much experience with dealing with large structures and files. I appreciate any advice.

Map the file into memory and use pointers to your structures. — Richard Critten, Oct 18 '21 at 14:47
@RichardCritten `reinterpret_cast`'ing the mapped memory is UB, it should be `memcpy()`'ed on demand into actual objects. — , Oct 18 '21 at 14:48
write the code, post it to the review. I would write event (not same thing as your events) driven code (event started, event ended, data bank start data bank stop, header, type), then adding code which extract only what is needed, should be quite easy. — Marek R, Oct 18 '21 at 14:58
This really sounds like a job for a database. Store the info in there, and then you can have custom scripts you run against the DB to get the data sets you actually want to work with. — NathanOliver, Oct 18 '21 at 14:59
@Frank this probably needs to be a separate question - if the file is written, closed, opened and then read back (memory mapped) - is the "object" still the original object such that `reinterpret_cast` is valid ? In other words does the C++ Object still persist in file storage ? — Richard Critten, Oct 18 '21 at 14:59
@RichardCritten There is no "original" object to speak of. You can't in-place construct an object on top of memory and have it be initialized with the values at that storage in the first place. — , Oct 18 '21 at 15:09
@NathanOliver Each file contains the collected data of an experiment running for about 5 minutes. If the experiment runs for 20 minutes I have 4 files. The data on each file is not something permanent that I keep looking at. I do different analysis on a run, and move one to the next run. I might come back to a previous run later to analyze something else, but that is it. — user17004502, Oct 18 '21 at 15:15
If it looks like a database and functions like a database, you'll probably want to use a database. — Thomas Matthews, Oct 18 '21 at 15:17
Long time ago, in the times of big box hard drives, files that were larger than the computer's memory (as most were), were processed in *chunks* or smaller portions. The Merge Sort is a classic example, Read a big block of data into memory, process the data, then output. You could have one thread for each of these. — Thomas Matthews, Oct 18 '21 at 15:20

score 0 · Answer 1 · answered Oct 18 '21 at 15:03

I do not know whether the following scenario matches yours.

I had a case of processing huge log files of hardware signal logging in the automotive area. Signals like door locked, radio on, temperature, and thousands more, appearing sometimes periodically. The operator selects some signal types and then analizes diagrams of signal values.

This scenario is based on a huge log file growing on passing time.

What I did was for every signal type creating its own logfile extract, in optimized binary format (one would load a fixed sized byte[] array).

This meant that having the diagram for just 10 types would be feasible to display fast, in real time. Zooming in on a time interval, dynamically selecting signal types, and so on.

I hope you got some ideas.

Dealing with large data binary files

1 Answers1

Linked