Should thread_block type be passed by reference?

Question

Question

When passing thread_group type objects to a device function, is there a preference for passing by reference vs passing by value?

Is one of them "correct"
What are the differences for each approach
When should each approach be preferred

Examples

Similar examples in the programming-guide and the developer blog seem to handle this differently.

Programming Guide

__device__
int sum(const thread_block& g, int *x, int n) {
    // ...
    g.sync()
    return total;
}

Developer Blog

__device__
int sum(thread_block block, int *x, int n) {
    ...
    block.sync();
    ...
    return total;
}

Additional Info

The programming-guide also has this to say about constructing implicit groups:

Although you can create an implicit group anywhere in the code, it is dangerous to do so. Creating a handle for an implicit group is a collective operation—all threads in the group must participate. If the group was created in a conditional branch that not all threads reach, this can lead to deadlocks or data corruption. For this reason, it is recommended that you create a handle for the implicit group upfront (as early as possible, before any branching has occurred) and use that handle throughout the kernel. Group handles must be initialized at declaration time (there is no default constructor) for the same reason and copy-constructing them is discouraged.

Which would lead me to believe passing them by reference is preferred, but I will admit there is more than enough detail underlying the various cooperative groups that it's likely I've missed some nuance. Would passing by value be considered "copy-constructing" and therefor be discouraged?

I have not noticed any performance or result difference using either one, but I may have just not tested the correct edge case; or the "undefined-behavior" may just be working out in a way that doesn't cause a problem.

You could always look at the PTX/SASS generated for your examples and see if there is any difference in the code generated between the two approaches. I would be very surprised if there was. — talonmies, Jun 15 '23 at 01:03
I would also interpret "copy-constructing them is discouraged" as "pass them by reference". I haven't looked at the current implementation, but I remember that certain groups had e.g. shared memory buffers for their reduction algorithm or similar. Not sure if that could become a problem and how e.g. non-inlined device functions without LTO would fare as an edge case. — paleonix, Jun 15 '23 at 16:17
@talonmies From what I can see in the PTX, the only difference is in the function signature. However this was done on a simple function and a GTX1060 (CC 6.1, CUDA 12.1, g++ 12.2). So some of the hardware accelerated functions like cg::plus<> are not available, so I don't know if it would differ on a card with higher compute capability. I modeled the 'block_reduce' function but passed the block in to see any difference. — John Mansell, Jun 18 '23 at 00:47

score 2 · Accepted Answer · answered Jun 18 '23 at 03:39

A few observations first:

The blog post you refer to is from 2017 when the feature was previewed, the documentation is current. On that basis alone you should favour the const pass-by-reference idiom because the source is newer.
As you have proved yourself, because CUDA uses a highly stripped back implementation of the C++ object model, and the compiler loves inline function expansion for performance, it is very unlikely that you would find real world cases where the compiler would generate different code for the two cases.

As I result, I would opine that the const pass-by-reference version is what you should use, both from a C++ language correctness POV, and because the current documentation suggests you should. There are probably corner cases where someone, somewhere, sometime, got burned by copy construction in the pass-by-value version, but I suspect you would have to try very hard for that to happen. Caveat emptor and all of that….