10

MSVC and ICC both support the intrinsics _addcarry_u64 and _addcarryx_u64.

According to Intel's Intrinsic Guide and white paper these should map to adcx and adox respectively. However, by looking at the generated assembly it's clear they map to adc and adcx respectively and there is no intrinsic which maps to adox.

Additionally, telling the compiler to enable AVX2 with /arch:AVX2 in MSVC or -march=core-avx2 with ICC on Linux makes no difference. I'm not sure how to enable ADX with MSVC and ICC.

The documentation for MSVC lists _addcarryx_u64 with the technology of ADX whereas _addcarry_u64 has no listed technology. However, the link in MSVC's documentation for these intrinsics goes directly to the Intel Intrinsic guide which contradicts MSVC's own documentation and the generated assembly.

From this I conclude that Intel's Intrinsic guide and white paper are wrong.

This makes some sense for MSVC sense it does not allow inline assembly it should provide a way to use adc which it does with _addcarry_u64.

One of the big advantages of adcx and adox is that they operate on different flags (carry CF and overflow OF) and this allows two independent parallel carry chains. However, since there is no intrinsic for adox how is this possible? With ICC at least one can use inline assembly but this is not possible with MSVC in 64-bit mode.


Microsoft and Intel's documentation (both the white paper and the intrinsic guide online) both agree now.

The _addcarry_u64 intrinsic documentation says produces only adc. The _addcarryx_u64 intrinsic can produce either adcx or adox. With MSVC 2013 and 2015, however, _addcarryx_u64 only produces adcx. ICC produces both.

Z boson
  • 32,619
  • 11
  • 123
  • 226
  • ADOX and ADCX are introduced since Broadwell so how can you make it work with AVX2? – phuclv Mar 25 '15 at 07:57
  • @LưuVĩnhPhúc, very good point. How do you enable `ADX` with MSVC and ICC? In any case the compiler still generated adcx with _addcarryx_u64 – Z boson Mar 25 '15 at 08:10
  • @LưuVĩnhPhúc, ICC 13.0.1 (which is what I use from https://gcc.godbolt.org/) came out late 2012 so it probably does not support `ADX`. ICC is up to 15.0.2 now. I think if ADX is not enabled then both MSVC and ICC use `adc` with `_addcarry_u64`. That would explain what I observe. The compilation should fail for `_addcarryx_u64` if `ADX` is not enabled though (and `_addcarryx_u64` is suppose to produce `adox` and not `adcx`) in my opinion. – Z boson Mar 25 '15 at 09:16
  • @LưuVĩnhPhúc, actually , the fact that both MSVC and ICC generate `adcx` says that the compilers I'm using do support `ADX`. I should not have to enable anything using intrinsics (e.g. you don't have to enable SSE4.2 in MSVC to use a SSE4.2 intrinsic). It would be strange (I mean I have not seen it yet) if an intrinsic changed its behavior based on what instruction set you told the compiler was available. – Z boson Mar 26 '15 at 09:18
  • 2
    I just got myself a Skylake laptop and this was one of the things I started playing with. On VS2015 and ICC15, `_addcarry_u64()` only ever produces `adc`. On VS2015, `_addcarryx_u64()` only produces `adcx`, never `adox`. On ICC15, `_addcarryx_u64()` can produce either `adcx` or `adox` at the compiler's discretion. The ICC15 actually tries to be smart about it and will generate `adox` when it detects that there are multiple carry chains. That said, it's not optimal, and I was able to beat ICC15 hands down on my very first try with inline assembly. – Mysticial Nov 14 '15 at 05:09
  • @Mysticial, that's exactly what I found with MSVC2013. That's how I was able to use inline `adc` with MSVC for this [answer](http://stackoverflow.com/questions/9145644/visual-c-x64-add-with-carry/29229641#29229641). I think I found the same thing with ICC13 at http://gcc.godbolt.org/. Maybe Intel "fixed" it in ICC15. [They already admitted they had their documentation wrong for mulx](http://stackoverflow.com/questions/29206981/intrinsic-for-the-mulx-instruction). – Z boson Nov 14 '15 at 20:53
  • The mulx/adcx/adox seems to have the same problems as the double-load/store. Each of mulx/adcx/adox has throughput of 1 on Skylake. But when I put the 3 together (which is what they're meant to do), I'm having trouble getting under 1.5 cycles per mulx/adcx/adox triplet when it should be possible to do it 1 cycle throughput. For that matter, I'm having trouble getting over 2 instructions/cycle for anything that has a lot of add-with-carries. – Mysticial Nov 14 '15 at 21:16
  • I don't have hardware to test this yet (I plan on buying a Skylake laptop soon). I mostly looked at assembly to learn about the instruction set. Broadwell has these instructions as well. I don't think Skylake added any new instructions except re-enabling MPX. – Z boson Nov 14 '15 at 21:32
  • @Mysticial, Intel's documentation agrees with the assembly now. So I was right. Intel's documentation was wrong. It's better to trust the assembly then the documentation. It seems to me Intel should have three intrinsics for `adc`, `adcx`, and `adox` rather than trying to be smart and having one intrinsic for both `adcx`, and `adox`. – Z boson Nov 18 '15 at 11:48
  • @Zboson I guess they wanted to the give the compiler the flexibility to choose. After playing around with these over the weekend, I found that unlike the carry flag, it's very difficult (high overhead) to save and restore the overflow bit. For completely inlined operations, that isn't a problem. They're meant for RSA which is a fixed 512-bit or whatever. So you just write it all out straight-line. A massive block of nothing but mulx, adxc, adox, and stores to memory. – Mysticial Nov 18 '15 at 15:52
  • 1
    But when you're trying to implement a variable length multiply where the dimensions of the operations are unknown until run-time, you need loops. And loops have counters which will clobber the flag register. Which means you need to save and restore both flags. And that sucks. – Mysticial Nov 18 '15 at 15:59

2 Answers2

6

Related, GCC does not support ADOX and ADCX at the moment. "At the moment" includes GCC 6.4 (Fedora 25) and GCC 7.1 (Fedora 26). GCC effectively disabled the intrinsics, but it still advertises support by defining __ADX__ in the preprocessor. Also see Issue 67317, Silly code generation for _addcarry_u32/_addcarry_u64. Many thanks to Xi Ruoyao for finding the issue.

According to Uros Bizjak on the GCC Help mailing list, GCC may never support the intrinsics. Also see GCC does not generate ADCX or ADOX for _addcarryx_u64.

Clang has its own set of issues with respect to ADOX and ADCX. Clang 3.9 and 4.0 crash when attempting to use them. Also see Issue 34249, Panic when using _addcarryx_u64 with Clang 3.9. According to Craig Topper, it should be fixed in Clang 5.0.

My apologies for posting the information under a MSVC question. This is one of the few hits when searching for information about using the intrinsics.

jww
  • 97,681
  • 90
  • 411
  • 885
5

They map to adc, adcx AND adox. The compiler decides which instructions to use, based on how you use them. If you perform two big-int additions in parallel the compiler will use adcx and adox, for higher throughput. For example:

unsigned char c1 = 0, c2 = 0
for(i=0; i< 100; i++){ 
    c1 = _addcarry_u64(c1, res[i], a[i], &res[i]);
    c2 = _addcarry_u64(c2, res[i], b[i], &res[i]);
}
Z boson
  • 32,619
  • 11
  • 123
  • 226
Vlad Krasnov
  • 1,027
  • 7
  • 12
  • I don't know how to enable `ADX` with ICC or MSVC. I'm not sure ICC 13 supports it. Maybe you can add the assembly output for an example to your answer? – Z boson Mar 25 '15 at 08:35
  • 1
    *"I don't know how to enable ADX with ICC..."* - For GCC, ICC and possibly Clang, use either `-march=native` (if its a native CPU feature), or add `-madx` feature flag. Related, see [Test case for adcx and adox](http://stackoverflow.com/q/39320944). Its not generating `adcx`/`adox` under GCC 6.1 at `-O3`. – jww Sep 04 '16 at 20:09
  • *"The compiler decides which instructions to use, based on how you use them..."* - Do you think a 4-way add-with-carry will get the compiler to generate something other than a serialized `adc`? – jww Sep 04 '16 at 20:31
  • 1
    @jww: This looks like a perfect use-case for adcx/adox. But I think compilers aren't smart enough to take advantage yet. For now they support the intrinsic to make correct code. Maybe in a year or so they will make code that's fast as well as correct, with no source changes. Note that clang3.8.1 does use adcx here, but with some really bad flag-saving (sahf) and push/pop of eax (and I think a bug in the asm source output, since `mov eax, ch` won't assemble. The [disassembly of the machine code](https://godbolt.org/g/fhHkJu) shows `mov eax,ebp` there. – Peter Cordes Sep 04 '16 at 21:25
  • @PeterCordes, my comments about enabling ADX with MSVC were a misunderstanding on my part. With GCC you enable these things with a flag (that's so that the compiler can optimize for them even with intrinsic). With MSVC if you use the intrinsics it just assumes the instruction exists. GCC would tell you that the intrinsics does not exists if you don't enable ADX. I don't know what MSVC does when there are better ways to implement intrinsics. I think it just assumes the lowest common denominator. I know that for auto-vectorization MSVC in some cases checks for SSE4.1 and otherwise uses SSE2. – Z boson Sep 05 '16 at 12:44
  • @Zboson: that info about MSVC might be useful on jww's new question (linked in his first comment). – Peter Cordes Sep 05 '16 at 12:53