11

I recently needed a lambda that captured multiple local variables by reference, so I made a test snippet to investigate its efficiency, and compiled it with -O3 using clang 3.6:

void do_something_with(void*);

void test()
{
    int a = 0, b = 0, c = 0;

    auto func = [&] () {
        a++;
        b++;
        c++;
    };

    do_something_with((void*)&func);
}

movl   $0x0,0x24(%rsp)
movl   $0x0,0x20(%rsp)
movl   $0x0,0x1c(%rsp)

lea    0x24(%rsp),%rax
mov    %rax,(%rsp)
lea    0x20(%rsp),%rax
mov    %rax,0x8(%rsp)
lea    0x1c(%rsp),%rax
mov    %rax,0x10(%rsp)

lea    (%rsp),%rdi
callq  ...

Clearly the lambda only needs the address of one of the variables, from which all the others could be obtained by relative addressing.

Instead, the compiler created a struct on the stack containing pointers to each local variable, and then passed the address of the struct to the lambda. It's much in the same way as if I had written:

int a = 0, b = 0, c = 0;

struct X
{
    int *pa, *pb, *pc;
};

X x = {&a, &b, &c};

auto func = [p = &x] () {
    (*p->pa)++;
    (*p->pb)++;
    (*p->pc)++;
};

This is inefficient for various reasons, but most worryingly because it could lead to heap-allocation if too many variables are captured.

My questions:

  1. The fact that both clang and gcc do this at -O3 makes me suspect that something in the standard actually forces closures to be implemented inefficiently. Is this the case?

  2. If so, then for what reasoning? It cannot be for binary compatibility of lambdas between compilers, because any code that knows about the type of the lambda is guaranteed to lie in the same translation unit.

  3. If not, then why is this optimisation missing from two major compilers?


EDIT:
Here is an example of the more efficient code that I would like to have seen from the compiler. This code uses less stack space, the lambda now only performs one pointer indirection instead of two, and the lambda's size does not grow in the number of captured variables:

struct X
{
    int a = 0, b = 0, c = 0;
} x;

auto func = [&x] () {
    x.a++;
    x.b++;
    x.c++;
};

movl   $0x0,0x8(%rsp)
movl   $0x0,0xc(%rsp)
movl   $0x0,0x10(%rsp)

lea    0x8(%rsp),%rax
mov    %rax,(%rsp)

lea    (%rsp),%rdi
callq  ...
PBS
  • 1,389
  • 11
  • 20
  • 2
    I cannot provide backing information from the standard atm, but according to http://en.cppreference.com/w/cpp/language/lambda ` For the entities that are captured by reference [...] it is unspecified if additional data members are declared in the closure type.` it seems that it's not forbidden to optimise that. – Daniel Jour Jul 09 '15 at 13:32
  • 1
    For what it's worth, if you write idiomatic C++ rather than relying on unspecified casts between function and object pointers, it'll probably be fine. ([Example](https://goo.gl/vLIqde)) – Kerrek SB Jul 09 '15 at 13:35
  • @KerrekSB: I knew someone would complain about that `void*` cast - I have to use the lambda to stop everything from being optimised away, but using std::function creates a lot of noise in the assembly output. Would `uintptr_t` be better? EDIT: in your example, everything was indeed constant-propagated and inlined away. – PBS Jul 09 '15 at 13:38
  • *"from which all the others could be obtained by relative addressing"*. How would that improve the performance? I believe it would be still the same thing. Could you post the idea with some code (to be generated by your imaginary compiler)? – Nawaz Jul 09 '15 at 13:39
  • 1
    So are you complaining that if you work really hard to "stop everything from being optimized away" then you end up with suboptimal output? – Kerrek SB Jul 09 '15 at 13:42
  • 2
    Does the C++ Standard force * to be inefficient? -> No. Does the C++ Standard allow undefined behaviour to be compiled to inefficient code? -> Yes. – rubenvb Jul 09 '15 at 13:43
  • @KerrekSB: In real life the lambda would be passed to a function possibly in another TU which takes a std::function - then the whole lambda could not be optimised away, but the suboptimal "allocate a struct of pointers" would remain. I am trying to simulate this in a minimal example. – PBS Jul 09 '15 at 13:56
  • 1
    what the compiler failed to optimize is not the lambda, lambda just does a translation to functor. Now if you declare a object on the stack and pass it to a opaque function expecting a `void*`, there is no optimization to be done. – yngccc Jul 09 '15 at 13:58
  • *"heap-allocation"* -- wut? How could heap allocation result because of this? – Yakk - Adam Nevraumont Jul 09 '15 at 14:18
  • @Yakk: No heap allocation will ever happen in this example, you are right. But if this lambda were ever passed to a `std::function`, then capturing four pointers rather than one could push it over the small-closure optimisation, causing a heap allocation - see http://stackoverflow.com/questions/12452022. – PBS Jul 09 '15 at 14:24
  • @ecatmur I disagree that this is a duplicate, this question is specifically asking what the standard requires which plays no part is the duplicate you link to. The seems like a large enough difference to me. – Shafik Yaghmour Jul 09 '15 at 14:30
  • @KerrekSB @rubenvb it's perfectly allowable to cast a pointer to an object of closure type to `void*`; closures are data objects, not functions. (You can't *do* much with it except inspect its first byte, as the closure type is unknown outside the enclosing function.) – ecatmur Jul 09 '15 at 14:31
  • @ShafikYaghmour this question specifically asks 3 questions; at least one of those is a duplicate. – ecatmur Jul 09 '15 at 14:32
  • @ecatmur: You're right, it's the address of the closure itself, not the converted function pointer. – Kerrek SB Jul 09 '15 at 14:34
  • 2
    @PBS a legitimate way to allow optimization while preventing optimizing away the whole lambda would be to pass a type-erased invoker: `A + (A -> B) => 0 + (0 -> B)`. For example: http://goo.gl/QnCwUK – ecatmur Jul 09 '15 at 14:36
  • @ecatmur it is split out into three questions but they are all follow from the main question which is what does the standard actually require and what is the rationale that it does/doesn't require that. Which is a different question then why does this happen. One is a language lawyer question, the other is not. I don't have an answer for the rationale yet at least, but it would be good if someone who does could add that answer. – Shafik Yaghmour Jul 09 '15 at 15:34

1 Answers1

6

It looks like unspecified behavior. The following paragraph from the C++14 draft standard: N3936 section 5.1.2 Lambda Expressions [expr.prim.lambda] makes me think this:

An entity is captured by reference if it is implicitly or explicitly captured but not captured by copy. It is unspecified whether additional unnamed non-static data members are declared in the closure type for entities captured by reference. [...]

which different for entities captured by copy:

Every id-expression within the compound-statement of a lambda-expression that is an odr-use (3.2) of an entity captured by copy is transformed into an access to the corresponding unnamed data member of the closure type.

Thanks to dyp for pointing out some relevant documents which I somehow missed. It looks like defect report 750: Implementation constraints on reference-only closure objects provides the rationale for the current wording, and it says:

Consider an example like:

void f(vector<double> vec) {
  double x, y, z;
  fancy_algorithm(vec, [&]() { /* use x, y, and z in various ways */ });
}

5.1.2 [expr.prim.lambda] paragraph 8 requires that the closure class for this lambda will have three reference members, and paragraph 12 requires that it be derived from std::reference_closure, implying two additional pointer members. Although 8.3.2 [dcl.ref] paragraph 4 allows a reference to be implemented without allocation of storage, current ABIs require that references be implemented as pointers. The practical effect of these requirements is that the closure object for this lambda expression will contain five pointers. If not for these requirements, however, it would be possible to implement the closure object as a single pointer to the stack frame, generating data accesses in the function-call operator as offsets relative to the frame pointer. The current specification is too tightly constrained.

which echos your exact points about allowing potential optimizations and was implemented as part of N2927 which includes the following:

The new wording no longer specifies any rewrite or closure members for "by reference" capture. Uses of entities captured "by reference" affect the original entities, and the mechanism to achieve this is left entirely to the implementation.

Shafik Yaghmour
  • 154,301
  • 39
  • 440
  • 740
  • 1
    See [CWG 753](http://www.open-std.org/JTC1/SC22/WG21/docs/cwg_defects.html#753), however also see [CWG 2011](http://www.open-std.org/JTC1/SC22/WG21/docs/cwg_active.html#2011) – dyp Jul 09 '15 at 16:49
  • @dyp Somehow I missed that in my searches, not sure how though. – Shafik Yaghmour Jul 09 '15 at 17:14
  • @dyp actually did you mean [dr 750](http://www.open-std.org/JTC1/SC22/WG21/docs/cwg_defects.html#750)? – Shafik Yaghmour Jul 09 '15 at 17:27
  • Yes :) I typically copy the URL from the browser's address bar, but it's sometimes not pointing to the DR I'm currently looking at. Would be nice if there were actual links in the HTML which one could copy.. maybe I should use cmeerw's database instead. – dyp Jul 09 '15 at 17:33
  • @dyp updated my answer. Yeah, a lot of what is available to the public could be better organized, it was painful writing my answer to [How does the standards committee indicate the status of a paper under consideration?](http://stackoverflow.com/q/31033130/1708801). I am sure it is out of lack of time rather than desire to improve it. – Shafik Yaghmour Jul 09 '15 at 17:38
  • Well I'm currently working on a way to improve access to the information in the github draft LaTeX repository; when I'm done with that, I'll probably start linking that info to other available information such as the issue lists. I just hope the upcoming copyright policy will be compatible with what I'm imagining. Its current lack prevents me from publishing intermediate results.. – dyp Jul 09 '15 at 18:06