0

According to the source of the Wikipedia page on the Knight's Landing chip, it has Airmont cores. According to this page, those cores support SSE4.2 instructions, that is, SIMD instructions on SIMD registers. Is that really the case? If so, what's the actual maximum width of, say, arithmetic instructions on these Airmont cores? (In terms of total width of the register, or width of a lane or element within the register x number of lanes).

einpoklum
  • 118,144
  • 57
  • 340
  • 684
  • 1
    Each core has two vector units which as well as 512 bit AVX-512 also support all SSE variants (at 128 bits of course) and likewise AVX/AVX2 (at 256 bits). The 512 bit ZMM registers can be used as 256 bit AVX (YMM) registers or 128 bit SSE (XMM) registers. If you want to do anything with 8 or 16 bit vector elements though you are limited to SSE/AVX2, since AVX512BW support is lacking. – Paul R Mar 16 '17 at 22:29
  • @PaulR: Just to be clear - I can issue 72 cores x 16 lanes = operations on 1132 32-bit values at once? – einpoklum Mar 16 '17 at 22:44
  • Yes, that's correct, although note that these cores are much simpler than current generation CPUs, so don't expect a similar instruction throughout as you might find on say a Haswell CPU. – Paul R Mar 16 '17 at 22:47
  • @PaulR: Suppose all I do is issue SIMD addition instructions. Why would I not expect a similar instruction throughput? Also, please make your comment an answer so I can accept it. – einpoklum Mar 16 '17 at 23:01
  • Well I've only benchmarked "real world" workloads on KNL, so I can't give you an answer for a hypothetical case such as a stream of add instructions. It might be worth checking Agner Fog's site to see if he has info on instruction throughout for KNL AVX512 instructions which you can compare with the equivalent SSE or AVX instructions on a normal CPU. I'll convert comments to an answer tomorrow if I can - currently it's late here and I'm on a mobile device. – Paul R Mar 16 '17 at 23:08
  • 2
    You can do 72 cores * 16 lanes * 2 operations = 2304 32-bit values at once. In fact it's better to count that in FLOPS because each FMA does 2 floating point operations. So it's 72*16*2*2*frequency = 4608 FLOP/cycle * frequency. – Z boson Mar 17 '17 at 08:41
  • 1
    Also KNL is based on airmount cores that's not the same as saying it has airmount cores. – Z boson Mar 17 '17 at 08:44
  • On my works 68 core KNL the frequency at turbo is 1.5 GHz. So the peak FLOPS is 68*16*2*2*1.5=6528 SP FLOPS. On some KNL (I think with less than 68 cores) AVX operations scale the frequency down to the P1 state (from the P0 state) so the frequency is something like 1.1 GHz. In fact, I am not 100% convinced my KNL does not do this either but some indirect tests and also literature seems to indicate that it does not. – Z boson Mar 17 '17 at 08:49
  • @Zboson: I don't like to bring frequency into the mix because that's kind of orthogonal. I mean obviously simpler processors can scale more, but that's not something that will effect how I would partition work among the cores. – einpoklum Mar 17 '17 at 09:26

1 Answers1

3

Each core has two vector units which, as well as 512 bit AVX-512, also support all SSE variants (at 128 bits of course), and likewise AVX/AVX2 (at 256 bits).

The 512 bit ZMM registers can be used as 256 bit AVX (YMM) registers or 128 bit SSE (XMM) registers. If you want to do anything with 8 or 16 bit vector elements though you are limited to SSE/AVX2, since AVX-512BW support is lacking.

Paul R
  • 208,748
  • 37
  • 389
  • 560
  • 1
    It might be worth pointing out that KNL does not have AVX512VL so although it can do SSE and AVX operations the mask operations you can use with AVX512 cannot be used for 256-bit and 128-bit operations. – Z boson Mar 17 '17 at 08:42