C++ choice of pass by value vs pass by reference for POD math structure classes for high performance applications considering cache coherency

Question

For many high performance applications, such as game engines or financial software, considerations of cache coherency, memory layout, and cache misses are crucial for maintaining smooth performance. As the C++ standard has evolved, especially with the introduction of Move Semantics and C++14, it has become less clear when to draw the line of pass by value vs. pass by reference for mathematical POD based classes.

Consider the common POD Vector3 class:

class Vector3
{
public:
   float32 x;
   float32 y;
   float32 z;
   // Implementation Functions below (all non-virtual)...
}

This is the most commonly used math structure in game development. It is a non-virtual, 12 byte size class, even in 64 bit since we are explicitly using IEEE float32, which uses 4 bytes per float. My question is as follows - What is the general best practice guideline to use when deciding to pass POD mathematical classes by value or by reference for high performance applications?

Some things for consideration when answering this question:

It is safe to assume the default constructor does not initialize any values
It is safe to assume no arrays beyond 1D are used for any POD math structures
Clearly most people pass 4-8 byte POD constants by value, so there doesn't seem to be much debate there
What happens when this Vector is a class member variable vs a local variable on the stack? If pass by reference is used, then it would use the memory address of the variable on the class vs a memory address of something local on the stack. Does this use-case matter? Could this difference where PBR is used result in more cache misses?
What about the case where SIMD is used or not used?
What about move semantic compiler optimizations? I have noticed that when switching to C++14, the compiler will often use move semantics when chain function calls are made passing the same vector by value, especially when it is const. I observed this by perusing the assembly breakdown
When using pass by value and pass by reference with these math structures, does const make a much impact on compiler optimizations? See the above point

Given the above, what is a good guideline for when to use pass by value vs pass by reference with modern C++ compilers (C++14 and above) to minimize cache misses and promote cache coherency? At what point might someone say this POD math structure is too large for pass by value, such as a 4v4 affine transform matrix, which is 64 bytes in size assuming use of float32. Does the Vector, or rather any small POD math structure, declared on the stack vs. being referenced as a member variable matter when making this decision?

I am hoping someone can provide some analysis and insight to where a good modern guideline for best practices can be established for the above situation. I believe the line has become more blurry as for when to use PBV vs PBR for POD classes as the C++ standard has evolved, especially in regard to minimizing cache misses.

Others will provide greater detail, but the bottom line, passing by reference does not require a copy or assignment to the local passed as a parameter, you simply pass a reference to the original. If nothing but speed mattered, I would likely pass a pointer (address of) the original since you have nothing more than three `float32` members and a few normal functions, that then allows simple access with the `->` operator.. — David C. Rankin, Sep 08 '20 at 00:41
@DavidC.Rankin Yes that is the classic trade-off. Anytime -> is used with a ptr or . with a reference, it is dereferencing that memory location to retrieve the value. The speed for which this is achieved depends on how close that memory is to the CPU registers. When using PBV, that memory is on the stack and much more likely to be closer to the CPU. The complexity here is that now compilers optimize with move semantics with modern C++. I don't think there is a difference between pass by pointer and pass by reference in terms of dereferencing it (without a nullptr check for the ptr). — Paul Renton, Sep 08 '20 at 00:46
Yep, that sums it up, and also why I chose to defer to the assembly and hardware folks here that will be more familiar with how this is optimized by the current field of compilers -- which unfortunately isn't something I keep up with in detail. I look forward to learning from a good answer as well. — David C. Rankin, Sep 08 '20 at 00:54
@DavidC.Rankin, but what if dereferencing is slower than copying? I think it's safer to say that the bottom line is to measure. If nothing but speed matters, then still measure. In fact always measure first, right? — SO_fix_the_vote_sorting_bug, Sep 12 '20 at 12:33

score 1 · Accepted Answer · answered Sep 08 '20 at 05:26

I see the question title is on the choice of pass-by-value vs. pass-by-reference, though it sounds like what you are after more broadly is the best practice to efficiently passing around 3D vectors and other common PODs. Passing data is fundamental and intertwined with programming paradigm, so there isn't a consensus on the best way to do it. Besides performance, there are considerations to weigh like code readability, flexibility, and portability to decide which approach to favor in a given application.

That said, in recent years, "data-oriented design" has become a popular alternative to object-oriented programming, especially in video game development. The essential idea is to think about the program in terms of data it needs to process, and how all that data can be organized in memory for good cache locality and computation performance. There was a great talk about it at CppCon 2014: "Data-Oriented Design and C++" by Mike Acton.

With your Vector3 example for instance, it is often the case that a program has not just one but many 3D vectors that are all processed the same way, say, all undergo the same geometric transformation. Data-oriented design suggests it is then a good idea to lay the vectors out in contiguously in memory and that they are all transformed together in a batch operation. This improves caching and creates opportunities to leverage SIMD instructions. You could implement this example with the Eigen C++ linear algebra library. The vectors can be represented using a Eigen::Matrix<float, 3, Eigen::Dynamic> of shape 3xN to store N vectors, then manipulated using Eigen's SIMD-accelerated operations.

C++ choice of pass by value vs pass by reference for POD math structure classes for high performance applications considering cache coherency

1 Answers1