5

I'm trying to compute the bit parity of a large number of uint64's. By bit parity I mean a function that accepts a uint64 and outputs 0 if the number of set bits is even, and 1 otherwise.

Currently I'm using the following function (by @Troyseph, found here):

uint parity64(uint64 n){
  n ^= n >> 1;
  n ^= n >> 2;
  n = (n & 0x1111111111111111) * 0x1111111111111111;
  return (n >> 60) & 1;
}

The same SO page has the following assembly routine (by @papadp):

.code

; bool CheckParity(size_t Result)
    CheckParity PROC
    mov     rax, 0
    add     rcx, 0
    jnp     jmp_over
    mov     rax, 1
jmp_over:
    ret
CheckParity ENDP

END

which takes advantage of the machine's parity flag. But I cannot get it to work with my C program (I know next to no assembly).

Question. How can I include the above (or similar) code as inline assembly in my C source file, so that the parity64() function runs that instead?

(I'm using GCC with 64-bit Ubuntu 14 on an Intel Xeon Haswell)


In case it's of any help, the parity64() function is called inside the following routine:

uint bindot(uint64* a, uint64* b, uint64 entries){
    uint parity = 0;

    for(uint i=0; i<entries; ++i)
      parity ^= parity64(a[i] & b[i]);  // Running sum!

    return parity;
}

(This is supposed to be the "dot product" of two vectors over the field Z/2Z, aka. GF(2).)

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
étale-cohomology
  • 2,098
  • 2
  • 28
  • 33

4 Answers4

10

This may sound a bit harsh, but I believe it needs to be said. Please don't take it personally; I don't mean it as an insult, especially since you already admitted that you "know next to no assembly." But if you think code like this:

CheckParity PROC
    mov     rax, 0
    add     rcx, 0
    jnp     jmp_over
    mov     rax, 1
 jmp_over:
    ret
CheckParity ENDP

will beat what a C compiler generates, then you really have no business using inline assembly. In just those 5 lines of code, I see 2 instructions that are glaringly sub-optimal. It could be optimized by just rewriting it slightly:

   xor     eax, eax
   test    ecx, ecx   ; logically, should use RCX, but see below for behavior of PF
   jnp     jmp_over
   mov     eax, 1     ; or possibly even "inc eax"; would need to verify
jmp_over:
   ret

Or, if you have random input values that are likely to foil the branch predictor (i.e., there is no predictable pattern to the parity of the input values), then it would be faster yet to remove the branch, writing it as:

xor     eax, eax
test    ecx, ecx
setp    al
ret

Or perhaps the equivalent (which will be faster on certain processors, but not necessarily all):

xor     eax, eax
test    ecx, ecx
mov     ecx, 1
cmovp   eax, ecx
ret

And these are just the improvements I could see off the top of my head, given my existing knowledge of the x86 ISA and previous benchmarks that I have conducted. But lest anyone be fooled, this is undoubtedly not the fastest code, because (borrowing from Michael Abrash), "there ain't no such thing as the fastest code"—someone can virtually always make it faster yet.

There are enough problems with using inline assembly when you're an expert assembly-language programmer and a wizard when it comes to the intricacies of the x86 ISA. Optimizers are pretty darn good nowadays, which means it's hard enough for a true guru to produce better code (though certainly not impossible). It also takes trustworthy benchmarks that will verify your assumptions and confirm that your optimized inline assembly is actually faster. Never commit yourself to using inline assembly to outsmart the compiler's optimizer without running a good benchmark. I see no evidence in your question that you've done anything like this. I'm speculating here, but it looks like you saw that the code was written in assembly and assumed that meant it would be faster. That is rarely the case. C compilers ultimately emit assembly language code, too, and it is often more optimal than what us humans are capable of producing, given a finite amount of time and resources, much less limited expertise.

In this particular case, there is a notion that inline assembly will be faster than the C compiler's output, since the C compiler won't be able to intelligently use the x86 architecture's built-in parity flag (PF) to its benefit. And you might be right, but it's a pretty shaky assumption, far from universalizable. As I've said, optimizing compilers are pretty smart nowadays, and they do optimize to a particular architecture (assuming you specify the right options), so it would not at all surprise me that an optimizer would emit code that used PF. You'd have to look at the disassembly to see for sure.

As an example of what I mean, consider the highly specialized BSWAP instruction that x86 provides. You might naïvely think that inline assembly would be required to take advantage of it, but it isn't. The following C code compiles to a BSWAP instruction on almost all major compilers:

uint32 SwapBytes(uint32 x)
{
    return ((x << 24) & 0xff000000 ) |
           ((x <<  8) & 0x00ff0000 ) |
           ((x >>  8) & 0x0000ff00 ) |
           ((x >> 24) & 0x000000ff );
}

The performance will be equivalent, if not better, because the optimizer has more knowledge about what the code does. In fact, a major benefit this form has over inline assembly is that the compiler can perform constant folding with this code (i.e., when called with a compile-time constant). Plus, the code is more readable (at least, to a C programmer), much less error-prone, and considerably easier to maintain than if you'd used inline assembly. Oh, and did I mention it's reasonably portable if you ever wanted to target an architecture other than x86?

I know I'm making a big deal of this, and I want you to understand that I say this as someone who enjoys the challenge of writing highly-tuned assembly code that beats the compiler's optimizer in performance. But every time I do it, it's just that: a challenge, which comes with sacrifices. It isn't a panacea, and you need to remember to check your assumptions, including:

  • Is this code actually a bottleneck in my application, such that optimizing it would even make any perceptible difference?
  • Is the optimizer actually emitting sub-optimal machine language instructions for the code that I have written?
  • Am I wrong in what I naïvely think is sub-optimal? Maybe the optimizer knows more than I do about the target architecture, and what looks like slow or sub-optimal code is actually faster. (Remember that less code is not necessarily faster.)
  • Have I tested it in a meaningful, real-world benchmark, and proven that the compiler-generated code is slow and that my inline assembly is actually faster?
  • Is there absolutely no way that I can tweak the C code to persuade the optimizer to emit better machine code that is close, equal to, or even superior to the performance of my inline assembly?

In an attempt to answer some of these questions, I set up a little benchmark. (Using MSVC, because that's what I have handy; if you're targeting GCC, it's best to use that compiler, but we can still get a general idea. I use and recommend Google's benchmarking library.) And I immediately ran into problems. See, I first run my benchmarks in "debugging" mode, with assertions compiled in that verify that my "tweaked"/"optimized" code is actually producing the same results for all test cases as the original code (that is presumably known to be working/correct). In this case, an assertion immediately fired. It turns out that the CheckParity routine written in assembly language does not return identical results to the parity64 routine written in C! Uh-oh. Well, that's another bullet we need to add to the above list:

  • Have I ensured that my "optimized" code is returning the correct results?

This one is especially critical, because it's easy to make something faster if you also make it wrong. :-) I jest, but not entirely, because I've done this many times in the pursuit of faster code.

I believe Michael Petch has already pointed out the reason for the discrepancy: in the x86 implementation, the parity flag (PF) only concerns itself with the bits in the low byte, not the entire value. If that's all you need, then great. But even then, we can go back to the C code and further optimize it to do less work, which will make it faster—perhaps faster than the assembly code, eliminating the one advantage that inline assembly ever had.

For now, let's assume that you need the parity of the full value, since that's the original implementation you had that was working, and you're just trying to make it faster without changing its behavior. Thus, we need to fix the assembly code's logic before we can even proceed with meaningfully benchmarking it. Fortunately, since I am writing this answer late, Ajay Brahmakshatriya (with collaboration from others) has already done that work, saving me the extra effort.

…except, not quite. When I first drafted this answer, my benchmark revealed that draft 9 of his "tweaked" code still did not produce the same result as the original C function, so it's unsuitable according to our test cases. You say in a comment that his code "works" for you, which means either (A) the original C code was doing extra work, making it needlessly slow, meaning that you can probably tweak it to beat the inline assembly at its own game, or worse, (B) you have insufficient test cases and the new "optimized" code is actually a bug lying in wait. Since that time, Ped7g suggested a couple of fixes, which both fixed the bug causing the incorrect result to be returned, and further improved the code. The amount of input required here, and the number of drafts that he has gone through, should serve as testament to the difficulty of writing correct inline assembly to beat the compiler. But we're not even done yet! His inline assembly remains incorrectly written. SETcc instructions require an 8-bit register as their operand, but his code doesn't use a register specifier to request that, meaning that the code either won't compile (because Clang is smart enough to detect this error) or will compile on GCC but won't execute properly because that instruction has an invalid operand.

Have I convinced you about the importance of testing yet? I'll take it on faith, and move on to the benchmarking part. The benchmark results use the final draft of Ajay's code, with Ped7g's improvements, and my additional tweaks. I also compare some of the other solutions from that question you linked, modified for 64-bit integers, plus a couple of my own invention. Here are my benchmark results (mobile Haswell i7-4850HQ):

Benchmark                         Time          CPU      Iterations
-------------------------------------------------------------------
Naive                            36 ns         36 ns       19478261
OriginalCCode                     4 ns          4 ns      194782609
Ajay_Brahmakshatriya_Tweaked      4 ns          4 ns      194782609
Shreyas_Shivalkar                37 ns         37 ns       17920000
TypeIA                            5 ns          5 ns      154482759
TypeIA_Tweaked                    4 ns          4 ns      160000000
has_even_parity                 227 ns        229 ns        3200000
has_even_parity_Tweaked          36 ns         36 ns       19478261
GCC_builtin_parityll              4 ns          4 ns      186666667
PopCount                          3 ns          3 ns      248888889
PopCount_Downlevel                5 ns          5 ns      100000000

Now, keep in mind that these are for randomly-generated 64-bit input values, which disrupts branch prediction. If your input values are biased in a predictable way, either towards parity or non-parity, then the branch predictor will work for you, rather than against you, and certain approaches may be faster. This underscores the importance of benchmarking against data that simulates real-world use cases. (That said, when I write general library functions, I tend to optimize for random inputs, balancing size and speed.)

Notice how the original C function compares to the others. I'm going to make the claim that optimizing it any further is probably a big fat waste of time. So hopefully you learned something more general from this answer, rather than just scrolled down to copy-paste the code snippets. :-)

The Naive function is a completely unoptimized sanity check to determine the parity, taken from here. I used it to validate even your original C code, and also to provide a baseline for the benchmarks. Since it loops through each bit, one-by-one, it is relatively slow, as expected:

unsigned int Naive(uint64 n)
{
   bool parity = false;
   while (n)
   {
      parity = !parity;
      n &= (n - 1);
   }
   return parity;
}

OriginalCCode is exactly what it sounds like—it's the original C code that you had, as shown in the question. Notice how it posts up at exactly the same time as the tweaked/corrected version of Ajay Brahmakshatriya's inline assembly code! Now, since I ran this benchmark in MSVC, which doesn't support inline assembly for 64-bit builds, I had to use an external assembly module containing the function, and call it from there, which introduced some additional overhead. With GCC's inline assembly, the compiler probably would have been able to inline the code, thus eliding a function call. So on GCC, you might see the inline-assembly version be up to a nanosecond faster (or maybe not). Is that worth it? You be the judge. For reference, this is the code I tested for Ajay_Brahmakshatriya_Tweaked:

Ajay_Brahmakshatriya_Tweaked PROC
    mov    rax, rcx   ; Windows 64-bit calling convention passes parameter in ECX (System V uses EDI)
    shr    rax, 32
    xor    rcx, rax
    mov    rax, rcx
    shr    rax, 16
    xor    rcx, rax
    mov    rax, rcx
    shr    rax, 8
    xor    eax, ecx   ; Ped7g's TEST is redundant; XOR already sets PF
    setnp  al
    movzx  eax, al
    ret
Ajay_Brahmakshatriya_Tweaked ENDP

The function named Shreyas_Shivalkar is from his answer here, which is just a variation on the loop-through-each-bit theme, and is, in keeping with expectations, slow:

Shreyas_Shivalkar PROC
   ; unsigned int parity = 0;
   ; while (x != 0)
   ; {
   ;    parity ^= x;
   ;    x     >>= 1;
   ; }
   ; return (parity & 0x1);
   xor     eax, eax
   test    rcx, rcx
   je      SHORT Finished
Process:
   xor     eax, ecx
   shr     rcx, 1
   jne     SHORT Process
Finished:
   and     eax, 1
   ret
Shreyas_Shivalkar ENDP

TypeIA and TypeIA_Tweaked are the code from this answer, modified to support 64-bit values, and my tweaked version. They parallelize the operation, resulting in a significant speed improvement over the loop-through-each-bit strategy. The "tweaked" version is based on an optimization originally suggested by Mathew Hendry to Sean Eron Anderson's Bit Twiddling Hacks, and does net us a tiny speed-up over the original.

unsigned int TypeIA(uint64 n)
{
   n ^= n >> 32;
   n ^= n >> 16;
   n ^= n >> 8;
   n ^= n >> 4;
   n ^= n >> 2;
   n ^= n >> 1;
   return !((~n) & 1);
}

unsigned int TypeIA_Tweaked(uint64 n)
{
   n ^= n >> 32;
   n ^= n >> 16;
   n ^= n >> 8;
   n ^= n >> 4;
   n &= 0xf;
   return ((0x6996 >> n) & 1);
}

has_even_parity is based on the accepted answer to that question, modified to support 64-bit values. I knew this would be slow, since it's yet another loop-through-each-bit strategy, but obviously someone thought it was a good approach. It's interesting to see just how slow it actually is, even compared to what I termed the "naïve" approach, which does essentially the same thing, but faster, with less-complicated code.

unsigned int has_even_parity(uint64 n)
{
   uint64 count = 0;
   uint64 b     = 1;
   for (uint64 i = 0; i < 64; ++i)
   {
      if (n & (b << i)) { ++count; }
   }
   return (count % 2);
}

has_even_parity_Tweaked is an alternate version of the above that saves a branch by taking advantage of the fact that Boolean values are implicitly convertible into 0 and 1. It is substantially faster than the original, clocking in at a time comparable to the "naïve" approach:

unsigned int has_even_parity_Tweaked(uint64 n)
{
   uint64 count = 0;
   uint64 b     = 1;
   for (uint64 i = 0; i < 64; ++i)
   {
      count += static_cast<int>(static_cast<bool>(n & (b << i)));
   }
   return (count % 2);
}

Now we get into the good stuff. The function GCC_builtin_parityll consists of the assembly code that GCC would emit if you used its __builtin_parityll intrinsic. Several others have suggested that you use this intrinsic, and I must echo their endorsement. Its performance is on par with the best we've seen so far, and it has a couple of additional advantages: (1) it keeps the code simple and readable (simpler than the C version); (2) it is portable to different architectures, and can be expected to remain fast there, too; (3) as GCC improves its implementation, your code may get faster with a simple recompile. You get all the benefits of inline assembly, without any of the drawbacks.

GCC_builtin_parityll PROC     ; GCC's __builtin_parityll
    mov    edx, ecx
    shr    rcx, 32
    xor    edx, ecx
    mov    eax, edx
    shr    edx, 16
    xor    eax, edx
    xor    al, ah
    setnp  al
    movzx  eax, al
    ret
GCC_builtin_parityll ENDP

PopCount is an optimized implementation of my own invention. To come up with this, I went back and considered what we were actually trying to do. The definition of "parity" is an even number of set bits. Therefore, it can be calculated simply by counting the number of set bits and testing to see if that count is even or odd. That's two logical operations. As luck would have it, on recent generations of x86 processors (Intel Nehalem or AMD Barcelona, and newer), there is an instruction that counts the number of set bits—POPCNT (population count, or Hamming weight)—which allows us to write assembly code that does this in two operations.

(Okay, actually three instructions, because there is a bug in the implementation of POPCNT on certain microarchitectures that creates a false dependency on its destination register, and to ensure we get maximum throughput from the code, we need to break this dependency by pre-clearing the destination register. Fortunately, this a very cheap operation, one that can generally be handled for "free" by register renaming.)

PopCount PROC
    xor     eax, eax   ; break false dependency
    popcnt  rax, rcx
    and     eax, 1
    ret
PopCount ENDP

In fact, as it turns out, GCC knows to emit exactly this code for the __builtin_parityll intrinsic when you target a microarchitecture that supports POPCNT (otherwise, it uses the fallback implementation shown below). As you can see from the benchmarks, this is the fastest code yet. It isn't a major difference, so it's unlikely to matter unless you're doing this repeatedly within a tight loop, but it is a measurable difference and presumably you wouldn't be optimizing this so heavily unless your profiler indicated that this was a hot-spot.

But the POPCNT instruction does have the drawback of not being available on older processors, so I also measured a "fallback" version of the code that does a population count with a sequence of universally-supported instructions. That is the PopCount_Downlevel function, taken from my private library, originally adapted from this answer and other sources.

PopCount_Downlevel PROC
    mov     rax, rcx
    shr     rax, 1
    mov     rdx, 5555555555555555h
    and     rax, rdx
    sub     rcx, rax
    mov     rax, 3333333333333333h
    mov     rdx, rcx
    and     rcx, rax
    shr     rdx, 2
    and     rdx, rax
    add     rdx, rcx
    mov     rcx, 0FF0F0F0F0F0F0F0Fh
    mov     rax, rdx
    shr     rax, 4
    add     rax, rdx
    mov     rdx, 0FF01010101010101h
    and     rax, rcx
    imul    rax, rdx
    shr     rax, 56
    and     eax, 1
    ret
PopCount_Downlevel ENDP

As you can see from the benchmarks, all of the bit-twiddling instructions that are required here exact a cost in performance. It is slower than POPCNT, but supported on all systems and still reasonably quick. If you needed a bit count anyway, this would be the best solution, especially since it can be written in pure C without the need to resort to inline assembly, potentially yielding even more speed:

unsigned int PopCount_Downlevel(uint64 n)
{
    uint64 temp = n - ((n >> 1) & 0x5555555555555555ULL);
    temp        = (temp & 0x3333333333333333ULL) + ((temp >> 2) & 0x3333333333333333ULL);
    temp        = (temp + (temp >> 4)) & 0x0F0F0F0F0F0F0F0FULL;
    temp        = (temp * 0x0101010101010101ULL) >> 56;
    return (temp & 1);
}

But run your own benchmarks to see if you wouldn't be better off with one of the other implementations, like OriginalCCode, which simplifies the operation and thus requires fewer total instructions. Fun fact: Intel's compiler (ICC) always uses a population count-based algorithm to implement __builtin_parityll; it emits a POPCNT instruction if the target architecture supports it, or otherwise, it simulates it using essentially the same code as I've shown here.

Or, better yet, just forget the whole complicated mess and let your compiler deal with it. That's what built-ins are for, and there's one for precisely this purpose.

Cody Gray - on strike
  • 239,200
  • 50
  • 490
  • 574
  • Sheeze, that is one heck of an answer, full of solid advise that only comes from the school of experience. Well done. – David C. Rankin May 12 '17 at 04:34
  • The analysis is amazing. We need more of such answers on SO. I am taking notes! – Ajay Brahmakshatriya May 12 '17 at 04:47
  • Such a beautiful answer..! Some notes: It makes no sense to assume "inline assembly" is faster, since eventually everything is; it's just a question of whether one can "out-assembly" the compiler. While impossible in general, sometimes the programmer simply has more information about the purpose of a piece of code than what the compiler can infer from the code itself (not saying that's the case here). (I mean, compilers are clever, but they're far from being general-intelligence agents). In this case, though, I was mostly excited about accessing stuff that C doesn't expose, ie. the parity flag – étale-cohomology May 12 '17 at 12:51
  • Great answer! I think it could be improved by moving your empirical analysis to a more prominent location (e.g. the top of the answer)... You could provide an introduction such as *"Contrary to the implicit assertion that inlined assembly is faster than optimised C code, I've discovered that this isn't necessarily the case. The analysis shown below proves, empirically, that the version provided in C is *just as fast* as the inline assembly version."* Then, once you've reeled the reader in with the figures, introduce the idea that *premature optimisation is misguided*. Nonetheless, well done! – autistic May 13 '17 at 03:30
9

Because C sucks when handling bit operations, I suggest using gcc built in functions, in this case __builtin_parityl(). See:

https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html

hdante
  • 7,685
  • 3
  • 31
  • 36
1

You will have to use extended inline assembly (which is a gcc extension) to get the similar effect.

Your parity64 function can be changed as follows -

uint parity64_unsafe_and_broken(uint64 n){
    uint result = 0;
    __asm__("addq $0, %0" : : "r"(n)  :);
   // editor's note: compiler-generated instructions here can destroy EFLAGS
   // Don't depending on FLAGS / regs surviving between asm statements
   // also, jumping out of an asm statement safely requires   asm goto
    __asm__("jnp 1f");
    __asm__("movl $1, %0" : "=r"(result) : : );
    __asm__("1:");
    return result;
}

But as commented by @MichaelPetch the parity flag is computed only on the lower 8 bits. So this will work for your if your n is less than 255. For bigger numbers you will have to use the code you mentioned in your question.

To get it working for 64 bits you can collapse the parity of the 32 bit integer into single byte by doing

n = (n >> 32) ^ n;
n = (n >> 16) ^ n;
n = (n >> 8) ^ n;

This code will have to be just at the start of the function before the assembly.

You will have to check how it affects the performance.

The most optimized I could get it is

uint parity64(uint64 n){
    unsigned char result = 0;
    n = (n >> 32) ^ n;
    n = (n >> 16) ^ n;
    n = (n >> 8) ^ n;
    __asm__("test %1, %1 \n\t"
            "setp %0"
            : "+r"(result)
            : "r"(n)
            :
    );
    return result;
}
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Ajay Brahmakshatriya
  • 8,993
  • 3
  • 26
  • 49
  • 4
    Problem here is that the parity flag on x86/x86-64 is only based on the low byte of the result. – Michael Petch May 10 '17 at 04:26
  • 1
    Also as suggested by @MichaelPetch parity flag will only look at the Least Significant byte. It won't work for `uint64`. I will only give the parity of the lower 8 bits. – Ajay Brahmakshatriya May 10 '17 at 04:30
  • @AjayBrahmakshatriya Yes, it works now! I get a 20-30% performance boost. Thank you! – étale-cohomology May 10 '17 at 04:30
  • 2
    @étale-cohomology : it doesn't really work. If you want the parity of the entire 64-bit register you have to look at other solutions. – Michael Petch May 10 '17 at 04:31
  • Ah, and there's no easy way to check the parity for the full 8 bytes? – étale-cohomology May 10 '17 at 04:31
  • @MichaelPetch Is collapsing the parity into the lower byte using 3 xors a valid solution? – Ajay Brahmakshatriya May 10 '17 at 04:39
  • 3
    You don't need the jump, use setp instead – Benoît May 10 '17 at 04:40
  • @Benoît I was trying to keep the assembly as close to what mentioned in his question so he can relate to what is happening. – Ajay Brahmakshatriya May 10 '17 at 04:41
  • @Benoît How would I use `setp` instead of `jnp` in this case? – étale-cohomology May 10 '17 at 04:47
  • 1
    @étale-cohomology after add you can do `__asm__("setp %0" : "=r" (result) : : );` instead of the next assembly lines. – Ajay Brahmakshatriya May 10 '17 at 04:51
  • 2
    @AjayBrahmakshatriya I get `operand type mismatch for setp` – étale-cohomology May 10 '17 at 05:03
  • 1
    @étale-cohomology sorry I forgot to change the type of `result` in the example. I had changed it on my machine. Please try now. – Ajay Brahmakshatriya May 10 '17 at 05:13
  • 1
    @étale-cohomology also the `andq` should have been 15 instead of 16. I just changed that. Sorry for the bug. – Ajay Brahmakshatriya May 10 '17 at 05:20
  • @AjayBrahmakshatriya Thank you! It works now. I'm accepting ths answer because I've learned a _great_ deal from you; however, as far as I can tell, GCC's `__builtin_parityl()` runs faster. (Maybe I'm doing something wrong?) – étale-cohomology May 10 '17 at 05:23
  • 3
    No, `__builtin_parityl()` is bound to work faster. It is written in the most optimized way depending on the target. – Ajay Brahmakshatriya May 10 '17 at 05:24
  • 1
    `n = n & 0xf;` should be `n = n & 0xff;` With `0xf` it will for example report `0x80` as parity even. And in asm `andq $255, ...`, which is actually not needed as long as you know the last `xor` was done by ordinary `xor` and flags remained, then only `setp` is needed. Which is not something what you can guarantee with compiler, so it's probably better to inline it all including `xor`. Of course the best idea is to use built-in intrinsic function. It makes me feel a bit uneasy the incorrect answer is marked as accepted... – Ped7g May 10 '17 at 10:09
  • @Ped7g Wow! How did I miss that. I will change the bytes. Or rather I will just remove the `and` and place a cheaper instruction like test with itself (I am not sure I know any cheaper instruction that does nothing but sets the flag). Yes I agree that builtins will be definitely faster. I wrote this answer to demonstrate how he can replicate what he saw in another question. Also I tried to keep the assembly as same as possible in the question. (Hence the first one with jmps etc). – Ajay Brahmakshatriya May 10 '17 at 11:11
  • I don't get why you are folding the 64bit to 32bit when you could use testq. – Benoît May 10 '17 at 21:51
  • 1
    @Benoit well, OP needs a solution using PF. But the PF is set by the hardware only according to the parity of the LSB. Hence all the folding. – Ajay Brahmakshatriya May 11 '17 at 02:19
  • 1
    I had earlier noticed the same bug as étale regarding the `SETP` instruction, and mentioned it in my answer. Since I didn't post my answer immediately, I see now that you have fixed this bug. The code works now, but it is sub-optimal. You get better results if you keep the `result` variable the same size as the native registers, but force the compiler to use a byte-sized register for the `SETP` instruction with the appropriate [constraint](https://gcc.gnu.org/onlinedocs/gcc/Machine-Constraints.html) for the output, which would be `+q` in this case, instead of the regular `+r`. – Cody Gray - on strike May 12 '17 at 04:32
  • 1
    You also technically don't even need the `TEST` instruction, because the compiler-emitted `XOR` instruction already sets the parity flag. Unfortunately, since you're allowing the optimizer to generate this code (better for performance), you can't really rely on which instruction will be emitted last, which sort of forces you into using the redundant `TEST`. It is a tricky balancing act, as usual, when dealing with inline asm. Also, it's likely that `CMOVP` would be faster than `SETP`. But as you already said, the built-in intrinsic is the best solution by far. – Cody Gray - on strike May 12 '17 at 04:35
  • @CodyGray, Ah, using `+q` really makes sense. Native size integers allow for other optimizations and using byte size subregisters allow use with setp. Yes, I realized that the `test` can be omitted, but as you mentioned, compiler behavior is unpredictable and may also vary across different compilers. Probably performing the last XOR in the same `__asm__` that does setp could solve the problem. Because definitely the compiler wouldn't any instruction inside one `__asm__` – Ajay Brahmakshatriya May 12 '17 at 04:39
  • Downvoted for totally unsafe first part of the answer. You should have used multiple lines inside a single `asm` statement, like `asm("foo\n\t"` `"bar\n\t"` `"baz" : stuff: :);`. Jumping outside a single `asm` statement requires `asm goto`. And if `result` is unused, only the `movl $1` part can be optimized away. The others are implicitly `asm volatile` because they have no outputs. See https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html#Volatile. – Peter Cordes Sep 19 '17 at 07:06
  • But since the `movl` part is *not* `volatile`, the compiler could hoist it out of a loop and assume that the other statements have nothing to do with the value of `result`. Because that's what you told it with your constraints. The `add / jnp` might execute every time in a loop, but you didn't tell the compiler that has anything to do with `result`. – Peter Cordes Sep 19 '17 at 07:09
  • @CodyGray: simplest way to avoid `test`: write the last `asm` statement to take two inputs: `x>>8` and `x`, and do the last `xor` inside the asm statement! (And BTW, `setcc` is almost always faster than `cmovcc`, because `cmov` needs two constants set up, and is 2 uops on Intel pre-Broadwell. `setcc` has annoying partial-register behaviour, but xor-zeroing makes it mostly fine, and it's a single uop on everything that's still relevant. Even a `setcc %al` / `movzx %al, %eax` is probably better than cmov.) Of course, with gcc6 you can produce a flag result from inline asm, and let gcc setp. – Peter Cordes Sep 19 '17 at 07:14
-1

How can I include the above (or similar) code as inline assembly in my C source file, so that the parity64() function runs that instead?

This is an XY problem... You think you need to inline that assembly to gain from its benefits, so you asked about how to inline it... but you don't need to inline it.

You shouldn't include assembly into your C source code, because in this case you don't need to, and the better alternative (in terms of portability and maintainability) is to keep the two pieces of source code separate, compile them separately and use the linker to link them.

In parity64.c you should have your portable version (with a wrapper named bool CheckParity(size_t result)), which you can default to in non-x86/64 situations.

You can compile this to an object file like so: gcc -c parity64.c -o parity64.o

... and then link the object code generated from assembly, with the C code: gcc bindot.c parity64.o -o bindot


In parity64_x86.s you might have the following assembly code from your question:

.code

; bool CheckParity(size_t Result)
    CheckParity PROC
    mov     rax, 0
    add     rcx, 0
    jnp     jmp_over
    mov     rax, 1
jmp_over:
    ret
CheckParity ENDP

END

You can compile this to an alternative parity64.o object file object code using gcc with this command: gcc -c parity64_x86.s -o parity64.o

... and then link the object code generated like so: gcc bindot.c parity64.o -o bindot


Similarly, if you wanted to use __builtin_parityl instead (as suggested by hdantes answer, you could (and should) once again keep that code separate (in the same place you keep other gcc/x86 optimisations) from your portable code. In parity64_x86.c you might have:

bool CheckParity(size_t result) {
    return __builtin_parityl(result);
}

To compile this, your command would be: gcc -c parity64_x86.c -o parity64.o

... and then link the object code generated like so: gcc bindot.c parity64.o -o bindot

On a side-note, if you'd like to inspect the assembly gcc would produce from this: gcc -S parity64_x86.c


Comments in your assembly indicate that the equivalent function prototype in C would be bool CheckParity(size_t Result), so with that in mind, here's what bindot.c might look like:

extern bool CheckParity(size_t Result);

uint64_t bindot(uint64_t *a, uint64_t *b, size_t entries){
    uint64_t parity = 0;

    for(size_t i = 0; i < entries; ++i)
        parity ^= a[i] & b[i];  // Running sum!

    return CheckParity(parity);
}

You can build this and link it to any of the above parity64.o versions like so: gcc bindot.c parity64.o -o bindot...

I highly recommend reading the manual for your compiler, when you have the time...

Community
  • 1
  • 1
autistic
  • 1
  • 3
  • 35
  • 80
  • 2
    Wouldn't writing in a separate module and linking it result in loss of inlining opportunities? – Ajay Brahmakshatriya May 10 '17 at 11:15
  • @AjayBrahmakshatriya I assume you're referring to the link-time optimisation, not to assembly inlining (the subject of this question) or use of the `inline` keyword (which provides no useful guarantees); you should try to be more explicit in the future. Did you read the manual? Alternatively, [here's a direct answer to your question](http://stackoverflow.com/questions/5987020/can-the-linker-inline-functions)... – autistic May 10 '17 at 22:26
  • No I am not referring to Link time optimisation. I am talking about automatic inlining done by the compiler. If the compiler doesn't see the function in the translation unit, it won't be able to inline. Remember that inlining can happen even without writing `inline`. If you add this function to the header as static inline, with inline assembly, it can be inlined, whereas if you put it into a separate asm file and link it, it won't be easy to inline. And yes, I do read the manual. – Ajay Brahmakshatriya May 11 '17 at 02:24
  • @AjayBrahmakshatriya *"I am not referring to Link time optimisation. I am talking about automatic inlining done by the compiler."* Then you're not considering the whole picture. My answer involves *linking*. *"... it won't be easy to inline"* is fallacious; all of the common, up-to-date linkers perform more complex optimisations such as dead code elimination and profile-guided optimisation. I hate to repeat myself, but for the second time, please see the following question: [Can *the linker* inline functions?](http://stackoverflow.com/questions/5987020/can-the-linker-inline-functions) – autistic May 13 '17 at 03:02
  • @AjayBrahmakshatriya : You're correct; unless your code is always built with LTO, putting tiny but performance-sensitive functions in separate files is a terrible idea. This answer 100% depends on link-time optimization to not suck. (You really should use LTO for production builds anyway, but many people won't if it's an open-source project. LTO makes it possible to move more stuff out of headers so changing one thing means less recompiling for debug builds that don't use LTO.) – Peter Cordes Jun 21 '19 at 05:07