Using reference arguments in function definition: perfomance?

Question

Is there any knowledge which one of two variants work faster, or they are the same, or it is incorrect to compare.

Vector test(Vector &vec)
{
 // return modified vector, or write directly to vec,
 // or do not return anything, but access vec anyway
}

Vector test(Vector vec)
{
 // same (but no reference)
}

I am asking because i should know it probably, to create best optimized code for Direct3D game.

UPDATE: I am talking about XMVECTOR from xnamath.h(d3d sdk) - 16 bytes, 4 floats.

Premature optimization. Profile first, optimize later. Unless you've determined this to be a bottleneck, chances are optimizing this won't change a thing — Borgleader, Jul 29 '13 at 22:38
Good advice, thanks. But i want to know for future, it is just interesting. — Loryan55, Jul 29 '13 at 22:41
@Borgleader Partial quotation from D.E. Knuth. Look up what he actually said. This is an important aspect of program design, not 'premature optimisation'. I've seen projects fail because of an incorrect choice here. There are semantic aspects to this choice, not just performance aspects. — user207421, Jul 29 '13 at 22:48
@EJP The question is tagged performance & optimization. Also, I fail to see how taking a copy or a reference in this case will "make the project fail" — Borgleader, Jul 29 '13 at 22:50
@Borlgeader Believe me, it did. The IDE being used generated call by value with gay abandon, and it became impossible to know who should actually release the dynamic objects. Result: memory leak city. It all had to be redone. If the tool had known better it wouldn't. I do suggest you look up what Knuth actually said. Not all optimisations are premature. Another example is the choice of a database over a flat file. It would be idiotic to build with the flat file first and then test and measure. — user207421, Jul 29 '13 at 22:53
One thing that bugs me about this comparison is that the two versions have very different *semantics*. While it's sometimes okay to change semantics to make optimizations possible, it's not fair to compare two implementations that are not interchangeable solely on their performance. The first function modifies `x` when called like `test(x)`, the second doesn't (it modifies a local copy, which is rarely what you want). A consequence is that the first can't be called with a temporary (e.g. `test(transmogrify(a))`). — , Jul 29 '13 at 22:53
@Borgleader i have just profiled my d3d code (but i dont have good profiler), and found that some function which operates with matrices or vectors takes some time (not much really). I just decided to rewrite some stuff, hope it will help. :P — Loryan55, Jul 29 '13 at 22:54
@Loryan55 I highly doubt it. You should be doing matrix and vector calculations on your GPU anyway — Bartek Banachewicz, Jul 29 '13 at 22:56
@EJP Sounds like failure to apply some basic C++ functionality (so-called RAII or maybe even the rule of three). Of course having such incompetent programmers causes a project to fail. — R. Martinho Fernandes, Jul 30 '13 at 18:19

Potatoswatter · Accepted Answer · 2013-07-29T23:32:01.113

This isn't the sort of thing that is useful to generalize about.

Googling for XMVECTOR, I get

typedef __m128 XMVECTOR;

Therefore despite being 16 bytes, it's all one SSE machine register, so you should certainly pass this sucker by value. Taking a reference to something in a register only risks forcing it onto the stack.

EDIT: Even if you aren't using the above typedef, XMVECTOR may still be a special type treated differently by the compiler. Observe the notes about the XBox platform. In any case, what I say below counts doubly:

Treating micro-optimization as idiomatic is the wrong approach. Micro-optimization starts at the machine code. The starting point here should be whatever machine instructions the profiler points at, because there are so many tiny bits and pieces in any program that you won't find the slow part just by intuition.

If you are just getting started on your first optimization project, you should research different profiling tools (which tell you what part of the program is slow) and familiarize yourself with one. Once you drill down enough, when you can't improve speed by adjusting what the source code says to do, you will have to begin analyzing machine instructions. This requires familiarizing yourself with the details of your CPU and its instruction set. Only then can you usefully begin adjusting trivial differences in how the source code says to do small things.

If you don't know much about how your CPU executes instructions, don't jump to optimizing that sort of thing. It's a complete waste of time, considering that the big fish are in the algorithm and overall structure of the program.

I dont use __m128. I use other typedef (Check line 225 of xnamath.h). By the way i am looking for a proof __m128 is better than struct __vector4 — Loryan55, Jul 29 '13 at 23:11
@Loryan55 Are you familiar with the difference between SSE and scalar x86/x87? — Potatoswatter, Jul 29 '13 at 23:19
No. I am trying to understand why it is better. I found the solution here: http://stackoverflow.com/questions/15753465/using-xmvector-from-directxmath-as-a-class-member-causes-a-crash-only-in-release how to properly use __m128 as class members (that was my problem why i switched to __vector4). But i still wondering actual difference. No i am not — Loryan55, Jul 29 '13 at 23:22
@Loryan55 SSE accesses a special, faster part of the CPU. You definitely want to use it if at all possible. Just Google for "SSE", Intel has a lot of marketing material and documents that can describe and explain better than me. — Potatoswatter, Jul 29 '13 at 23:26

Mats Petersson · Answer 2 · 2013-07-29T23:02:00.273

Edit: See bottom for specifics on Vector that is 16 bytes long.

It is very likely that the first one is significantly faster if the vector has more than a few elements (or the elements are themselves quite large).

However, "the devil is in the detail" as they say. It's possible that, under some specific circumstances, the second case is indeed faster. That would be an exception rather than the rule, but it's still a possibility.

In the second case, the vector is being copied [unless the compiler can inline the code AND the compiler can realise what is going on, and remove the extra copy]. If the vector has 10000 elements, that's 10000 copies of whatever is in the vector.

In the first case, all that is passed from the calling function to the caller function is a single pointer. On the other hand, since it's a reference, the generated code would have to make one more memory reference to read the content. So if the vector is very small, and the test function is doing quite a few accesses to the vec variable, it is possible that the extra overhead of the indirection is "worse" than the copy of the content.

If in doubt, benchmark the two solutions.

Make sure that the benchmark is representative - you can get it equally wrong by making it 100x faster for 10k elements, and then end up with 2x slower when the number of elements is less than 20 - and the average is 11...

Edit: Since the question was updated, I have to add that "since Vector object is quite small", it's much less likely to be a significant difference between the choices. On a 32-bit system, the pass by reference option is likely to still have a small benefit [but, as I said in the above, it's balanced against more complex access to the Vector content]. On a 64-bit system, it's quite possible that passing two register values is faster than a reference.

Again, benchmark under "normal" type loads.

Thanks for your time answering me! :) Interesting – Loryan55 Jul 29 '13 at 23:33 — Loryan55, Jul 29 '13 at 23:33

score 0 · Answer 3 · answered Jul 29 '13 at 22:45

0

A vector argument passed by reference would be faster, more so in case of a vector with many elements in it. That way you're simply avoiding the time spent in making a local copy.

answered Jul 29 '13 at 22:45

Perfervor

9
1

I think that's about it .. Thanks. – Loryan55 Jul 29 '13 at 22:46
2

Reading this answer, I'm now wondering. Did you mean xyz vector or std::vector? – Borgleader Jul 29 '13 at 22:51
1

It can be faster or can be not faster. This depends on many factors. – Bartek Banachewicz Jul 29 '13 at 22:55

score 0 · Answer 4 · answered Jul 29 '13 at 22:46

0

You should always pass objects by reference except when you need to pass an address, for example if you also want to allow a null pointer. Passing objects by value implies:

Copying
Object slicing

Neither of which you want to happen.

answered Jul 29 '13 at 22:46

user207421

305,947
44
307
483

2

RE 1. Not necessarily, with move semantics and compiler optimizations. Maybe 10 years ago, or on an embedded platform. And even then, for small-ish types the copy isn't necessarily slower. RE 2. only a concern if you expect and accommodate subclasses. For many classes, you don't and subclasses have more problems than slicing. – Jul 29 '13 at 22:50
When you want to store the value of the object, it's better to pass by value and do the copy in the interface; you would copy anyway, and it's better to invoke copy constructor outside and move constructor inside. – Bartek Banachewicz Jul 29 '13 at 23:01
Also passing by value doesn't really "imply" copying; if the argument is an rvalue, there will be no copy if object is move-constructible. – Bartek Banachewicz Jul 29 '13 at 23:03
Yes here i think the same by the way. – Loryan55 Jul 29 '13 at 23:10

score 0 · Answer 5 · edited Jun 20 '20 at 09:12

0

Premature optimizations are the root of all evil.

It's mostly premature optimization. It's also a microoptimization. As such it requires more knowledge about Vector type and desired usage, your compiler, and a lot of other factors.

These two aren't also equal; the latter won't accept rvalues and will allow the vector to be changed by the function. You should use const& to make them really similar.

You said that it's a D3D app; in that case (except for precomputations), you really want to be doing vector and matrix calculations on your GPU. Simple profiler won't help with that, you need to profile both CPU and GPU code.

And as @Potatoswatter noticed, this is a type that your CPU will optimize more that it would if you passed it by reference.

edited Jun 20 '20 at 09:12

Community

1
1

answered Jul 29 '13 at 22:54

Bartek Banachewicz

38,596
7
91
135

Again this is a partial quote. What Knuth actually said was something like '99% of the time, premature optimisation is the root of all evil'. And not all optimizations are premature. This answer just begs the question. – user207421 Jul 29 '13 at 22:57
1

@EJP no, it doesn't. This question cannot be answered precisely even with knowledge about Vector type, because I could prepare a scenario when one would win and other scenario when other would, rendering any answers saying "x is faster" useless. In fact I am currently working on a program that uses a lot of POD types passed by value and this is *far* from being a bottleneck. And regarding this 1% cases, it's mostly about algorithm, not microoptimizations like that. – Bartek Banachewicz Jul 29 '13 at 22:59
1

I think the ~~third~~ fourth sentence should be bolded, not the first. – Ben Voigt Jul 29 '13 at 23:00
Thank you for the answer. I am in the start generally, i ll remember your advice for close future. – Loryan55 Jul 29 '13 at 23:05

Using reference arguments in function definition: perfomance?

5 Answers5

Premature optimizations are the root of all evil.