Is a 2D array an optimized solution for analying large data sets?

Question

As a member of a group of experimental particle physicists, we are currently tasked with developing new experimental methods to evaluate the validity of our theoretical collaborators using both simulated and actual event data. Typically, the process involves evaluating n-tuples containing f32 values (time, directional velocities, charge, momentum, magnetic field, etc). I'll probably use an array in the code based on reading the tutorials. Additionally, we are looking at possible alternatives to the C++ staple of the past. I have opted to learn and evaluate Rust while others have chosen different languages. We estimate the data size to on the order of 1 PB for simulations.

My problem is determining a best approach for storing the n-tuples so they can be analyzed efficiently. Reading the book, What is Ownership? touches on the access of heap & stack and the improved efficiency of keeping a literal of a known size on the stack vs the heap (example uses String on the heap). Also, Array Efficiency says arrays with elements of 240+ experience are significantly inefficient so I should probably take this into consideration.

I wrote sample code with comments to assess the approach using an iterator (which appears to be faster than just using a standard range for x in 0..N and is recommend in the Book). The expectation was to see the sample array printed 29 times

Is my approach sound or is there a more optimized recommendation?

I'm still going through the Book, but I try to keep my project in mind as I learn the language.

Sample Code:

fn main() {
//  Example 2D array of 8-tuples containing sample data
//  Question: Is it more efficient to have an array(s) of a set size
//  created at compile time for access on stack or import data and store
//  on heap? Cost of reading files and importing to stack vs
//  importing to heap initially. Consider a 1-D vs 2-D array efficiency?
//  see https://stackoverflow.com/questions/57458460/why-is-there-a-large-performance-impact-when-looping-over-an-array-with-240-or-m
let tuple = \[0.0023, 5.4233, 3.3344, 4.3344, 10.0333, 3.2220, 4.2333,  7.4431\];

    //  Populate sample 2-D array
    let arr: [[f32; 8]; 29] = [tuple; 29];
    
    //  create iterator to loop over the array
    let iter_arr = arr.iter();
    
    //  Loop over the tuple and apply statistical analysis and store result.
    //  Book says iterator is more efficient as looping over a range
    //  adds a comparison with every loop. Each event batch will contain no more
    //  than 200k tuples which would increase processing time.
    for x in iter_arr {
        //  create iterator for each inner array
        let iter_tuple = x.iter();
        
        //  Loop over the inner array
        for y in iter_tuple {
            //  perform analysis
            // store result
            //  send to graphical simulator: significant overhead
            //  store resulting image
            
            //  test
            print!("  {}", *y); 
        }   //  end for y
        println!("");
    } // end for x

Output:

The output prints the 2D-array as expected which implies that it is a feasible solution; however, the optimization for speed on 2D-arrays is not covered in the Book (at least not yet).

  0.0023  5.4233  3.3344  4.3344  10.0333  3.222  4.2333  7.4431
  0.0023  5.4233  3.3344  4.3344  10.0333  3.222  4.2333  7.4431
  0.0023  5.4233  3.3344  4.3344  10.0333  3.222  4.2333  7.4431
  0.0023  5.4233  3.3344  4.3344  10.0333  3.222  4.2333  7.4431
  0.0023  5.4233  3.3344  4.3344  10.0333  3.222  4.2333  7.4431
  0.0023  5.4233  3.3344  4.3344  10.0333  3.222  4.2333  7.4431
  0.0023  5.4233  3.3344  4.3344  10.0333  3.222  4.2333  7.4431
  0.0023  5.4233  3.3344  4.3344  10.0333  3.222  4.2333  7.4431
  0.0023  5.4233  3.3344  4.3344  10.0333  3.222  4.2333  7.4431
  0.0023  5.4233  3.3344  4.3344  10.0333  3.222  4.2333  7.4431
  0.0023  5.4233  3.3344  4.3344  10.0333  3.222  4.2333  7.4431
  0.0023  5.4233  3.3344  4.3344  10.0333  3.222  4.2333  7.4431
  0.0023  5.4233  3.3344  4.3344  10.0333  3.222  4.2333  7.4431
  0.0023  5.4233  3.3344  4.3344  10.0333  3.222  4.2333  7.4431
  0.0023  5.4233  3.3344  4.3344  10.0333  3.222  4.2333  7.4431
  0.0023  5.4233  3.3344  4.3344  10.0333  3.222  4.2333  7.4431
  0.0023  5.4233  3.3344  4.3344  10.0333  3.222  4.2333  7.4431
  0.0023  5.4233  3.3344  4.3344  10.0333  3.222  4.2333  7.4431
  0.0023  5.4233  3.3344  4.3344  10.0333  3.222  4.2333  7.4431
  0.0023  5.4233  3.3344  4.3344  10.0333  3.222  4.2333  7.4431
  0.0023  5.4233  3.3344  4.3344  10.0333  3.222  4.2333  7.4431
  0.0023  5.4233  3.3344  4.3344  10.0333  3.222  4.2333  7.4431
  0.0023  5.4233  3.3344  4.3344  10.0333  3.222  4.2333  7.4431
  0.0023  5.4233  3.3344  4.3344  10.0333  3.222  4.2333  7.4431
  0.0023  5.4233  3.3344  4.3344  10.0333  3.222  4.2333  7.4431
  0.0023  5.4233  3.3344  4.3344  10.0333  3.222  4.2333  7.4431
  0.0023  5.4233  3.3344  4.3344  10.0333  3.222  4.2333  7.4431
  0.0023  5.4233  3.3344  4.3344  10.0333  3.222  4.2333  7.4431
  0.0023  5.4233  3.3344  4.3344  10.0333  3.222  4.2333  7.4431

Still working on learning the methods for formatting questions properly. Apologies in advance

Unless you need to go back and forth between samples, I'd say don't bother with the array: load one tuple, process it, then load the next sample and so on. — Jmb, Nov 05 '22 at 19:09
Anyway, as for all performance-related questions, the only answer is to benchmark in your setup with real data _in release mode_. For that you may find [the `criterion` crate](https://crates.io/crates/criterion) useful. — Jmb, Nov 05 '22 at 19:11
@Jmb With each batch of events being ~200,000, would this cause a significant overhead retrieving just one at a time vs say 29? The buffer is full at 196k events. — thephysicsprogrammer, Nov 05 '22 at 19:11
What form does this batch take? Is is a single file? One file per event? Or a request to a web server? — Jmb, Nov 05 '22 at 19:14
@jmp in the past, we have used a single file for storing the ~200k events. Its a file in the ROOT framework (see https://root.cern/ ). For the current approach, we intend to use a single system language to run the entire experiment (analog to digital conversions, data storage, analysis, graphical simulation [openGL]) since we will also be using it to access custom hardware (thus the switch from ROOT). Storage methodology has yet to be addressed as it is most likely language specific for optimizational purposes. — thephysicsprogrammer, Nov 05 '22 at 19:20
If the data is in a single file and if each tuple is contiguous in that file, then reading them one by one will probably be faster than reading them in batch, provided you read the file through a [`BufReader`](https://doc.rust-lang.org/std/io/struct.BufReader.html). — Jmb, Nov 05 '22 at 19:34
@Jmb going to work on the Criterion crate today and see it produces. Thanks again. — thephysicsprogrammer, Nov 06 '22 at 15:22

Is a 2D array an optimized solution for analying large data sets?

Is my approach sound or is there a more optimized recommendation?

0 Answers0