clang insanely fast bitfield operations in loop

Question

I am trying out some bitfield operations and benchmarking them following the info in this post. The code I am using is essentialy the same and shown below.

I've compiled the code with

❯ g++ bench.cpp -std=c++20 -march=native -O3 -o g++bench.out
❯ clang++ bench.cpp -std=c++20 -march=native -O3 -o clang++bench.out

Results:

❯ ./g++bench.out
operations on struct in memory
bitfields: 0.00443397
570425344
separate ints: 0.00320708
570425344
explicit and/or/shift: 0.0721971
570425344

operations on struct larger than memory
bitfields: 0.202714
570425344
separate ints: 0.127191
570425344
explicit and/or/shift: 0.102186
570425344

❯ ./clang++bench.out
operations on struct in memory
bitfields: 0.00304556
570425344
separate ints: 0.00291514
570425344
explicit and/or/shift: 0.00276303
570425344

operations on struct larger than memory
bitfields: 0.00350051
570425344
separate ints: 0.116294
570425344
explicit and/or/shift: 0.0909704
570425344

What mainly strikes me is that the clang code for the bitfields in the large vector is almost 30 times faster than the clang version using separate ints or explicit and/or/shift and 58 times faster than the g++ compiled version for bitfields.

As the code for operations on a struct in memory all runs in the same time I suspect there is no special optimization for the operations itself but clang is doing some clever memory fetching or loop unrolling.

Can anyone explain why the clang bitfield code in this case is so fast (or maybe if there's just a bug in the benchmark)?

Also I would like to know if the benchmark code can be adapted so g++ can get the same speedup.

#include <time.h>
#include <iostream>
#include <vector>

struct A
{
    void a(unsigned n) { a_ = n; }
    void b(unsigned n) { b_ = n; }
    void c(unsigned n) { c_ = n; }
    void d(unsigned n) { d_ = n; }
    unsigned a() { return a_; }
    unsigned b() { return b_; }
    unsigned c() { return c_; }
    unsigned d() { return d_; }
    unsigned a_:1,             
             b_:5,
             c_:2,
             d_:8;
};

struct B
{
    void a(unsigned n) { a_ = n; }
    void b(unsigned n) { b_ = n; }
    void c(unsigned n) { c_ = n; }
    void d(unsigned n) { d_ = n; }
    unsigned a() { return a_; }
    unsigned b() { return b_; }
    unsigned c() { return c_; }
    unsigned d() { return d_; }
    unsigned a_, b_, c_, d_;
};

struct C
{
    void a(unsigned n) { x_ &= ~0x01; x_ |= n; }
    void b(unsigned n) { x_ &= ~0x3E; x_ |= n << 1; }
    void c(unsigned n) { x_ &= ~0xC0; x_ |= n << 6; }
    void d(unsigned n) { x_ &= ~0xFF00; x_ |= n << 8; }
    unsigned a() const { return x_ & 0x01; }
    unsigned b() const { return (x_ & 0x3E) >> 1; }
    unsigned c() const { return (x_ & 0xC0) >> 6; }
    unsigned d() const { return (x_ & 0xFF00) >> 8; }
    unsigned x_;
};

struct Timer
{
    Timer() { get(&start_tp); }
    double elapsed() const {
        struct timespec end_tp;
        get(&end_tp);
        return (end_tp.tv_sec - start_tp.tv_sec) +
               (1E-9 * end_tp.tv_nsec - 1E-9 * start_tp.tv_nsec);
    }
  private:
    static void get(struct timespec* p_tp) {
        if (clock_gettime(CLOCK_REALTIME, p_tp) != 0)
        {
            std::cerr << "clock_gettime() error\n";
            exit(EXIT_FAILURE);
        }
    }
    struct timespec start_tp;
};

template <typename T>
unsigned f()
{
    int n = 0;
    Timer timer;
    T t;
    for (int i = 0; i < 1024*1024*32; ++i)
    {
        t.a(i & 0x01);
        t.b(i & 0x1F);
        t.c(i & 0x03);
        t.d(i & 0xFF);
        n += t.a() + t.b() + t.c() + t.d();
    }
    std::cout << timer.elapsed() << '\n';
    return n;
}

template <typename T>
unsigned g()
{
    int n = 0;
    Timer timer;
    std::vector<T> ts(1024 * 1024 * 16);
    for (size_t i = 0, idx = 0; i < 1024*1024*32; ++i)
    {
        T& t = ts[idx];
        t.a(i & 0x01);
        t.b(i & 0x1F);
        t.c(i & 0x03);
        t.d(i & 0xFF);
        n += t.a() + t.b() + t.c() + t.d();
        idx++;
        if (idx >= ts.size()) {
            idx = 0;
        }
    }
    std::cout << timer.elapsed() << '\n';
    return n;
}

int main()
{
    std::cout << "operations on struct in memory" << std::endl;
    std::cout << "bitfields: " << f<A>() << '\n';
    std::cout << "separate ints: " << f<B>() << '\n';
    std::cout << "explicit and/or/shift: " << f<C>() << '\n';
    std::cout << std::endl;

    std::cout << "operations on struct larger than memory" << std::endl;
    std::cout << "bitfields: " << g<A>() << '\n';
    std::cout << "separate ints: " << g<B>() << '\n';
    std::cout << "explicit and/or/shift: " << g<C>() << '\n';
    std::cout << std::endl;
}

Do not tag C for C++ questions. – Eric Postpischil Sep 28 '21 at 11:03 — Eric Postpischil, Sep 28 '21 at 11:03

Tiger4Hire · Answer 1 · 2021-09-28T14:38:49.813

Good question!

Looking at the Godbolt output for the two loops, it looks like clang has simply optimized better.

  add     rcx, 1
  add     edx, 64
  add     r10d, 2
  add     r9d, 256

UPDATE: In C++, this is like having

   for (size_t i_field1=0, size_t i_field2=0,size_t i_field3=0, size_t i_field4=0, size_t i = 0, idx = 0;
      i < 1024*1024*32; 
      i_field1+=1, i_field2+=2, i_field3+=64, i_field4+256=,++i, ++idx)
    {
        T& t = ts[idx];
        i_field1&=1; 
        i_field2&=(31<<1); 
        i_field3&=(3<<6); 
        i_field4&=(255<<8);  
        t = i_field1+i_field2+i_field3+i_field4;
        n += t;
        if (idx >= ts.size()) {
            idx = 0;
        }
    }

It has allocated a counter for each of the bit fields, and increments them each-time around the loop. Each of these bit fields is "and'ed" with its bit field-mask each loop, which in ASM terms is as fast as it gets. Both GCC and Clang use a trick that the bit fields add up 16, and so it can pretend it's a 16-bit unsigned number to read/write from. The GCC code, implements the code much more literally, with a single index which is shifted/masked each iteration.

As with all optimizations that use more registers, it's not as simple as saying the clang code is better, as it is using more space, but I'd say +1 for clang.

As for how to make the GCC use the same trick, you could try coding the clang trick - in C++, and see how the GCC changes. However, how generically useful that would be is questionable.

UPDATE:
This means that clang is simply better at optimizing than GCC, it has shifted the increment for each field, rather than incrementing a counter that is shifted for each field. The clang compiler has basically understand the intent of the code and implemented a different way, to get the same result, which is something compilers are allowed to do.

My assembly skills are nonexistend but i'll take this as "program worked ok but doesn't do what programmer intended". I changed the program so it would initialize the values in the bitfields from an array of random numbers instead of straight from the loopcounter. Would this make sense?Clan bitfield version is still 2 times faster than g++ bitfield version and also 3 times faster than clang int version — ODON, Sep 28 '21 at 13:11
Ah, I will edit my answer to make it clearer. Thanks for the feedback — Tiger4Hire, Sep 28 '21 at 14:04
Appreciate the update. This makes it much clearer for me. I'm looking into bitfields as i'd like to build a neural network where there will be a whole class (and huge number) of neurons only needing 16 "states". Idea is to pack the 4 bits needed for this into a larger struct. So speed matters for me. Of course input will not be from a simple counter. Also trying this in opencl which sadly doesn't support bitfields so am looking to improve the explicit and/or/shift code. Will study your answer to see if i can reap some good ideas for this. — ODON, Sep 28 '21 at 15:04
Understood, your approach seems a good one to me, I have used bit-fields for many network/games applications, and they work well. Keeping your number of bits to a power of two will help keep the code fast. There are gotcha's to careful of though. The packing order may be different on different compilers, and beware signed fields of size one (the valid values are 0/-1, not 0/1, as you might expect) — Tiger4Hire, Sep 28 '21 at 15:13

clang insanely fast bitfield operations in loop

1 Answers1