0

As a previous post says and also wiki, "ivy bridge can do "8 DP FLOPs/cycle: 4-wide AVX addition + 4-wide AVX multiplication" I'm a bit confused here, I know ivy bridge doesn't have FMA, and AVX instruction set can do 4 DP/cycle, so why 4 addition + 4 multiplication?

Donald Duck
  • 8,409
  • 22
  • 75
  • 99
  • Please format your question here. This is not a WhatsApp message, its a proper and formal Q&A site. – BusyProgrammer Feb 17 '17 at 02:19
  • 1
    Because while both addition and multiplication have a throughput of 1/c on Ivy, it can do *both* of them even though they're not linked together in an FMA. – harold Feb 17 '17 at 14:00
  • @harold Thanks for the reply! I'm trying to understand a little more here. On intel's manual http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf page 653 and 654, I see the throughput for add and mul instruction on ivy bridge (the 06_3A /3E column) is 1, am I looking at the right thing? – Junjie Li Feb 18 '17 at 04:46
  • @JunjieLi it's a bit hard to tell what page that is, but if I got it right that's a page about integer vector instructions. AVX float operation are on the bottom of page C-7, page C-8. But it is not enough. I don't see anywhere that spells it out, but on page "2-15" in the diagram you can see that fpadd and fpmul go to different functional units through different ports, based on just that they could very likely execute in parallel, and it turns out that they actually can. – harold Feb 18 '17 at 12:09
  • @harold The diagram is very inspiring! Thanks a lot! – Junjie Li Feb 18 '17 at 16:08
  • Intel's tables that just show throughput are crap for this; check https://uops.info/ and https://agner.org/optimize/ to see whether two instructions compete for the same execution unit or not. [latency vs throughput in intel intrinsics](https://stackoverflow.com/q/40878534) – Peter Cordes Apr 08 '21 at 19:00

0 Answers0