Multi-dimensional datasets in C++: cleanest approach to go from a std::vector of 2D data, to a 2D grid of std::vectors?

Question

Context: I've been processing scientific satellite images, currently keeping the individual end results at each timestamp as cv::Mat_<double>, which can for instance be stored in a std::container of images, such as a std::vector<cv::Mat_<double>>.

The issue: I would now like to study the physical properties of each individual pixel over time. For that, it would be far preferable if I could look at the data along the time dimension and work with a 2D table of vectors instead. In other words: to have a std::vector<double> associated to each pixel on the 2D grid that is common to all images.

A reason for that is that the type of calculations (computing percentiles, curve fitting, etc) will rely on std::algorithms and libraries which expect to be fed with std::vectors and the like. For a given pixel the data is definitely not contiguous in memory along the time dimension though.

Can/Should I really avoid copying the data in such a case? If yes, what would be the best approach, then? By best I mean efficient yet as 'clean'/'clear' as possible.

I thought of std::reference_wrapper to store the addresses in a std::vector; it's simple and works but each entry takes as much memory as if I had simply duplicated the data in a std::vector<double>. Each data point is just a double after all.

NB: I've stumbled upon Boost MultiArray, but I'd like to avoid having to add a Boost dependency.

Many thanks in advance for your time/input.

targets iterators, not containers. Have you considered writing custom iterators for the original data? — Captain Giraffe, May 27 '20 at 16:20
Why are you even trying to avoid a copy? A copy is as fast as it gets for your requirements. — Ext3h, May 27 '20 at 16:25
@CaptainGiraffe Thank you very much for the idea. No, and I have actually never done that, so I'm not sure how hard it is; I'll definitely have a look. It might be a bit premature now, but interesting in the longer run. — Alex, May 27 '20 at 17:24
@Ext3h I simply always try to question extra copies. But in this specific case I just couldn't find an alternative satisfactory approach. Hence this sanity check, in case I missed an extremely obvious solution. Thank you for your comment. I guess I've been playing a bit too much with python and got used to getting numpy arrays out of essentially everything. — Alex, May 27 '20 at 17:24
The iterator requirements differs between algorithms but begin(), end() and operator ++ and operator * should be relatively straightforward. Should you be interested in examples you can find a related discussion here https://stackoverflow.com/q/3582608/451600 . Almost all algorithms don't require a complete iterator. — Captain Giraffe, May 27 '20 at 19:28
For future reference and anyone who might also be interested: I've since learned that in the literature this discussion is actually called "AoS vs SoA" (Array of Structure vs Structure of Array), as seen on these slides about the Eigen library: http://downloads.tuxfamily.org/eigen/eigen_CGLibs_Giugno_Pisa_2013.pdf#page=94 — Alex, Jun 19 '20 at 07:47

score 1 · Accepted Answer · answered May 28 '20 at 12:01

1

You could try something like std::views::transform (or it's precursors, range-v3 and boost range adaptors), with function objects to lookup each pixel

[x, y](cv::Mat_<double> & mat) -> double & { return mat[y][x]; }

However you should definitely profile if that is worthwhile vs copying, as I expect the cache locality to be horrible.

answered May 28 '20 at 12:01

Caleth

52,200
2
44
75

Very interesting, thank you. I had never looked into Ranges and views in C++20, as I'm using older compilers. I actually did not know that this was called locality either, but it indeed feels like it's going to be pretty bad. I think I'll just accept dealing with copies now, especially since it's obviously going to be a much simpler approach. Out of curiosity, I've had a quick look at numpy, where (even more) extreme/arbitrary slicing is frequent (as in pandas and xarray), and they actually simply return a copy in some cases. – Alex May 28 '20 at 14:31

Multi-dimensional datasets in C++: cleanest approach to go from a std::vector of 2D data, to a 2D grid of std::vectors?

1 Answers1