3

Let's say I have a std::tuple<T...> and I want to get access to its nth element efficiently, where n is only known at runtime. Since the types T... are heterogeneous, all I can get is a void * and I'm OK with that. Here's what I've arrived at:

template <size_t ... Indexes, class Tuple>
void * get_element_pointer(std::index_sequence<Indexes...>, Tuple & t, size_t idx) {
    static size_t offsets[] = {(size_t)(void *)&std::get<Indexes>(t) - (size_t)(void *)(&t)...};
    return (void *)((size_t)(void *)(&t) + offsets[idx]);
}

You then call it like:

get_element_pointer(std::index_sequence_for<T...>{}, some_tuple, some_index);

The gist of this is to create statically a size_t array offsets that contains the list of offsets of each tuple element. Then, at runtime, I can just lookup the offset and add it to the passed tuple.

Two problems bug me with my solution:

  1. offsets is created the first time this function is called, and it's created based on the tuple instance passed at the time. I find this a bit weird. I could create a bogus temporary of type Tuple, but it might not be default constructible. Alternatively, I could cast a nullptr to a Tuple *, but then the std::get<Indexes>(*(Tuple *)(nullptr)) screams UB.
  2. The (size_t)(void *)(&t) and (void *)((size_t)(void *)(&t) + offsets[idx]) pointer juggling is the only way I could find that stops the compiler from giving me warnings. I know pointer conversions can be tricky and non-trivial when you have virtual functions etc. So I'm worried that I might be missing something.

Do you think my solution is acceptable? Can you think of a simpler solution with less pointer juggling?

enobayram
  • 4,650
  • 23
  • 36

3 Answers3

3

Having looked at the solutions, I took your concerns about performance to heart and decided to see if we could do better.

Interestingly my attempts to optimise with constexpr had varying results depending on the compiler.

I'll compare the output of gcc 5.3 and apple clang here:

here was my solution:

#include <utility>
#include <tuple>
#include <iostream>


template<class Tuple, size_t Index> 
  void* get_address(Tuple& t)
{
  return std::addressof(std::get<Index>(t));
}

template <size_t ... Indexes, class Tuple>
constexpr void* get_element_pointer(Tuple & t, 
                          size_t idx, 
                          std::index_sequence<Indexes...>) 
{
  using function_type = void* (*)(Tuple&); 
  function_type constexpr ptrs[] = 
  {
    &get_address<Tuple, Indexes>...
  };
    return ptrs[idx](t);
}


template<class Tuple>
__attribute__((noinline))
constexpr 
  void * get_element_pointer(Tuple& t, size_t index)
{
  return get_element_pointer(t, 
                             index, 
                             std::make_index_sequence<std::tuple_size<Tuple>::value>());
}

int main()
{
  std::tuple<int, int, int, int, int, int, int , int, int, int> x;
  x = std::make_tuple(4, 5, 6, 7, 8, 9, 10, 11, 12, 13);
  std::cout << *reinterpret_cast<int*>(get_element_pointer(x, 1)) << std::endl;
}

(compiled with -O2 -fomit-frame-pointer for clarity)

clang's solution was this:

__Z19get_element_pointerINSt3__15tupleIJiiiiiiiiiiEEEEPvRT_m:
    .align  4, 0x90
    leaq    __ZZ19get_element_pointerIJLm0ELm1ELm2ELm3ELm4ELm5ELm6ELm7ELm8ELm9EENSt3__15tupleIJiiiiiiiiiiEEEEPvRT0_mNS0_16integer_sequenceImJXspT_EEEEE4ptrs(%rip), %rax
    jmpq    *(%rax,%rsi,8)          ## TAILCALL

which as expected refers to a compile-time generated jump table:

__ZZ19get_element_pointerIJLm0ELm1ELm2ELm3ELm4ELm5ELm6ELm7ELm8ELm9EENSt3__15tupleIJiiiiiiiiiiEEEEPvRT0_mNS0_16integer_sequenceImJXspT_EEEEE4ptrs:
    .quad   __Z11get_addressINSt3__15tupleIJiiiiiiiiiiEEELm0EEPvRT_
    .quad   __Z11get_addressINSt3__15tupleIJiiiiiiiiiiEEELm1EEPvRT_
    .quad   __Z11get_addressINSt3__15tupleIJiiiiiiiiiiEEELm2EEPvRT_
    .quad   __Z11get_addressINSt3__15tupleIJiiiiiiiiiiEEELm3EEPvRT_
    .quad   __Z11get_addressINSt3__15tupleIJiiiiiiiiiiEEELm4EEPvRT_
    .quad   __Z11get_addressINSt3__15tupleIJiiiiiiiiiiEEELm5EEPvRT_
    .quad   __Z11get_addressINSt3__15tupleIJiiiiiiiiiiEEELm6EEPvRT_
    .quad   __Z11get_addressINSt3__15tupleIJiiiiiiiiiiEEELm7EEPvRT_
    .quad   __Z11get_addressINSt3__15tupleIJiiiiiiiiiiEEELm8EEPvRT_
    .quad   __Z11get_addressINSt3__15tupleIJiiiiiiiiiiEEELm9EEPvRT_

where each accessor function is trivial (example of one provided):

__Z11get_addressINSt3__15tupleIJiiiiiiiiiiEEELm2EEPvRT_:
    leaq    8(%rdi), %rax
    retq

This is was I assumed the compiler would do, being "what I would do if I were writing machine code"

However gcc seems to miss an opportunity to optimise the jump table and builds it in memory before using it!

void* get_element_pointer<std::tuple<int, int, int, int, int, int, int, int, int, int> >(std::tuple<int, int, int, int, int, int, int, int, int, int>&, unsigned long):
        movq    void* get_address<std::tuple<int, int, int, int, int, int, int, int, int, int>, 0ul>(std::tuple<int, int, int, int, int, int, int, int, int, int>&), -88(%rsp)
        movq    void* get_address<std::tuple<int, int, int, int, int, int, int, int, int, int>, 1ul>(std::tuple<int, int, int, int, int, int, int, int, int, int>&), -80(%rsp)
        movq    void* get_address<std::tuple<int, int, int, int, int, int, int, int, int, int>, 2ul>(std::tuple<int, int, int, int, int, int, int, int, int, int>&), -72(%rsp)
        movq    void* get_address<std::tuple<int, int, int, int, int, int, int, int, int, int>, 3ul>(std::tuple<int, int, int, int, int, int, int, int, int, int>&), -64(%rsp)
        movq    void* get_address<std::tuple<int, int, int, int, int, int, int, int, int, int>, 4ul>(std::tuple<int, int, int, int, int, int, int, int, int, int>&), -56(%rsp)
        movq    void* get_address<std::tuple<int, int, int, int, int, int, int, int, int, int>, 5ul>(std::tuple<int, int, int, int, int, int, int, int, int, int>&), -48(%rsp)
        movq    void* get_address<std::tuple<int, int, int, int, int, int, int, int, int, int>, 6ul>(std::tuple<int, int, int, int, int, int, int, int, int, int>&), -40(%rsp)
        movq    void* get_address<std::tuple<int, int, int, int, int, int, int, int, int, int>, 7ul>(std::tuple<int, int, int, int, int, int, int, int, int, int>&), -32(%rsp)
        movq    void* get_address<std::tuple<int, int, int, int, int, int, int, int, int, int>, 8ul>(std::tuple<int, int, int, int, int, int, int, int, int, int>&), -24(%rsp)
        movq    void* get_address<std::tuple<int, int, int, int, int, int, int, int, int, int>, 9ul>(std::tuple<int, int, int, int, int, int, int, int, int, int>&), -16(%rsp)
        movq    -88(%rsp,%rsi,8), %rax
        jmp     *%rax

before calling a similar trivial accessor:

void* get_address<std::tuple<int, int, int, int, int, int, int, int, int, int>, 3ul>(std::tuple<int, int, int, int, int, int, int, int, int, int>&):
        leaq    24(%rdi), %rax
        ret

So undeterred, I wondered whether constant folding in a non-constexpr implementation might do better:

template <size_t ... Indexes, class Tuple>
void* get_element_pointer(Tuple & t, 
                          size_t idx, 
                          std::index_sequence<Indexes...>) 
{
  using function_type = void* (*)(Tuple&); 
  function_type static const ptrs[] = 
  {
    &get_address<Tuple, Indexes>...
  };
    return ptrs[idx](t);
}

Turns out it did - I now get the same code on gcc as clang produced with the constexpr solution:

void* get_element_pointer<std::tuple<int, int, int, int, int, int, int, int, int, int> >(std::tuple<int, int, int, int, int, int, int, int, int, int>&, unsigned long):
        movq    void* get_element_pointer<0ul, 1ul, 2ul, 3ul, 4ul, 5ul, 6ul, 7ul, 8ul, 9ul, std::tuple<int, int, int, int, int, int, int, int, int, int> >(std::tuple<int, int, int, int, int, int, int, int, int, int>&, unsigned long, std::integer_sequence<unsigned long, 0ul, 1ul, 2ul, 3ul, 4ul, 5ul, 6ul, 7ul, 8ul, 9ul>)::ptrs(,%rsi,8), %rax
        jmp     *%rax

what did clang make of this?

__Z19get_element_pointerINSt3__15tupleIJiiiiiiiiiiEEEEPvRT_m:
    movq    __ZZ19get_element_pointerIJLm0ELm1ELm2ELm3ELm4ELm5ELm6ELm7ELm8ELm9EENSt3__15tupleIJiiiiiiiiiiEEEEPvRT0_mNS0_16integer_sequenceImJXspT_EEEEE4ptrs@GOTPCREL(%rip), %rax
    jmpq    *(%rax,%rsi,8)          ## TAILCALL

Happily the same result.

So here's the final, provably optimal solution:

template<class Tuple, size_t Index>
void* get_address(Tuple& t)
{
    return std::addressof(std::get<Index>(t));
}

template <size_t ... Indexes, class Tuple>
void* get_element_pointer(Tuple & t,
                                    size_t idx,
                                    std::index_sequence<Indexes...>)
{
    using function_type = void* (*)(Tuple&);
    function_type static const ptrs[] =
    {
        &get_address<Tuple, Indexes>...
    };
    return ptrs[idx](t);
}


template<class Tuple>
__attribute__((noinline))
constexpr
void * get_element_pointer(Tuple& t, size_t index)
{
    return get_element_pointer(t,
                               index,
                               std::make_index_sequence<std::tuple_size<Tuple>::value>());
}
Richard Hodges
  • 68,278
  • 7
  • 90
  • 142
  • Wow, thanks for the detailed analysis! Your solution looks much nicer than mine, especially for avoiding pointer arithmetic. I'm accepting your answer as it's much cleaner, but I doubt it's optimal. Looking up a value in an offset table and adding it to the tuple's address would probably yield better performance than a jump table, if anything due its effects on branch prediction. I'll probably end up using your solution though, since I don't need that last iota of performance. – enobayram Apr 22 '16 at 10:35
  • @enobayram I can't think how to build a compile-time solution that uses offsets because std::addressof is not a constant expression. Because of this, the lookup table gets built in code, with an atomic guard to protect the static constructor. – Richard Hodges Apr 22 '16 at 10:50
  • I agree, the problem is hard, but it's disappointing not to be able to express what's more efficient without resorting to UB. – enobayram Apr 22 '16 at 11:09
  • @enobayram Thinking it through, I actually don't think an offset lookup would be any more efficient. Since the arithmetic would be dependent on an indexed memory read you still have a pipeline stall - it's just that it's in the arithmetic pipeline rather than the instruction pipeline. I really don't think you'd be able to measure the difference between this solution and a UB-assisted index-offset solution. – Richard Hodges Apr 22 '16 at 11:20
  • Depends on what happens next, right? If the following instructions do something unrelated to the resulting pointer you're calculating, there might not be a stall at all. This is actually quite likely, since the compiler will probably inline this `get_element_pointer` call. – enobayram Apr 22 '16 at 12:13
  • @enobayram yes I did think about that. But then if we are in a situation in which performance matters at all (i.e. a tight loop) then the jump table will be in a hot cache I would have thought, so it may matter less than one would intuitively think... but what do i know? I think to find out we'd have to build a convoluted test harness and time the two results. I bet that whatever we did with the result of the lookup would take far longer than the lookup itself! :-)) – Richard Hodges Apr 22 '16 at 12:28
2

Why not just:

template <size_t ... Indexes, class Tuple>
void* get_element_pointer(std::index_sequence<Indexes...>, Tuple & t, size_t idx) {
    void* ptrs[] = { static_cast<void *>(std::addressof(std::get<Indexes>(t)))... };
    return ptrs[idx];
}

Note that I use std::addressof to handle evil class with overloaded operator &.

For your warnings, you should replace your std::size_t by std::intptr_t and/or char*:

static std::intptr_t offsets[] = {
    reinterpret_cast<char *>(std::addresof(std::get<Indexes>(t)))
    - reinterpret_cast<char *>(&t)...
};
static_cast<void *>(reinterpret_cast<char *>(&t) + offsets[idx]);
Jarod42
  • 203,559
  • 14
  • 181
  • 302
  • I've actually considered this alternative, but I can't be sure that the compiler will generate efficient code in this case, since you're creating the array of pointers at each call. I'd really be impressed if a compiler were able to optimize this (think about what it has to do). Thanks for the heads up about `addressof`. As for using `char *` please check my comments below John Zwinck's answer. – enobayram Apr 22 '16 at 08:29
  • @enobayram I took your concerns to heart and did some tests. take a look at my answer. I think the results are interesting. – Richard Hodges Apr 22 '16 at 09:59
0
  1. Using the first-passed instance doesn't seem like a problem to me, in terms of correctness. You're right to point out that default constructibility is a problem if you try to create a tuple ahead of time, but then again you could cast nullptr to a tuple* and use that.

  2. Perhaps (void *)((size_t)(void *)(&t) + offsets[idx]) could be more simply written as reinterpret_cast<char*>(&t) + offsets[idx].

John Zwinck
  • 239,568
  • 38
  • 324
  • 436
  • 1. `std::get(*(Tuple *)(nullptr))` really scares me off. 2. My first attempt was using `char *`, but I'm not sure what the guarantees are around pointer conversions through `char *`. I know that `(T *)` -> `(void *)` -> `(T *)` is guaranteed to be correct regardless of how messy a type `T` is. I'm not sure what `char *` promises on non POD types though. – enobayram Apr 22 '16 at 06:20
  • Well `t` is a tuple, which is not some crazy polymorphic class, so maybe that alleviates your concern about `char*`? As for casting `nullptr`, well, C people may remember that's how we used to do `offsetof()`: http://stackoverflow.com/questions/7897877/how-does-the-c-offsetof-macro-work – John Zwinck Apr 22 '16 at 06:30
  • The trouble is not the tuple itself, you don't know in advance what the elements of the tuple are, they can be crazy polymorphic types. – enobayram Apr 22 '16 at 07:40