C code loop performance [continued]

Question

This question continues on my question here (on the advice of Mystical):

Continuing on my question, when i use packed instructions instead of scalar instructions the code using intrinsics would look very similar:

for(int i=0; i<size; i+=16) {
    y1 = _mm_load_ps(output[i]);
    …
    y4 = _mm_load_ps(output[i+12]);

    for(k=0; k<ksize; k++){
        for(l=0; l<ksize; l++){
            w  = _mm_set_ps1(weight[i+k+l]);

            x1 = _mm_load_ps(input[i+k+l]);
            y1 = _mm_add_ps(y1,_mm_mul_ps(w,x1));
            …
            x4 = _mm_load_ps(input[i+k+l+12]);
            y4 = _mm_add_ps(y4,_mm_mul_ps(w,x4));
        }
    }
    _mm_store_ps(&output[i],y1);
    …
    _mm_store_ps(&output[i+12],y4);
    }

The measured performance of this kernel is about 5.6 FP operations per cycle, although i would expect it to be exactly 4x the performance of the scalar version, i.e. 4.1,6=6,4 FP ops per cycle.

Taking the move of the weight factor into account (thanks for pointing that out), the schedule looks like:

schedule

It looks like the schedule doesn't change, although there is an extra instruction after the movss operation that moves the scalar weight value to the XMM register and then uses shufps to copy this scalar value in the entire vector. It seems like the weight vector is ready to be used for the mulps in time taking the switching latency from load to the floating point domain into account, so this shouldn't incur any extra latency.

The movaps (aligned, packed move),addps & mulps instructions that are used in this kernel (checked with assembly code) have the same latency & throughput as their scalar versions, so this shouldn't incur any extra latency either.

Does anybody have an idea where this extra cycle per 8 cycles is spent on, assuming the maximum performance this kernel can get is 6.4 FP ops per cycle and it is running at 5.6 FP ops per cycle?

By the way here is what the actual assembly looks like:

…
Block x: 
  movapsx  (%rax,%rcx,4), %xmm0
  movapsx  0x10(%rax,%rcx,4), %xmm1
  movapsx  0x20(%rax,%rcx,4), %xmm2
  movapsx  0x30(%rax,%rcx,4), %xmm3
  movssl  (%rdx,%rcx,4), %xmm4
  inc %rcx
  shufps $0x0, %xmm4, %xmm4               {fill weight vector}
  cmp $0x32, %rcx 
  mulps %xmm4, %xmm0 
  mulps %xmm4, %xmm1
  mulps %xmm4, %xmm2 
  mulps %xmm3, %xmm4
  addps %xmm0, %xmm5 
  addps %xmm1, %xmm6 
  addps %xmm2, %xmm7 
  addps %xmm4, %xmm8 
  jl 0x401ad6 <Block x> 
…

So I guess the question now is: "Why does the `shufps` instruction add 1 cycle every 1.6 iterations?" That's a tough one... — Mysticial, Apr 04 '12 at 08:19
i would expect it to have no overhead since the output of the `shufps` should directly be available to the `multps` op since it's both FP domain — Ricky, Apr 04 '12 at 08:24
Easy to find out. Make sure that the weight vector does not contain any denormalized values values. Try the loop without the shuffle instruction. It will not produce any useful results, but maybe your find which instruction does cost you additional cycles (I suspect the shuffle, of course). — Gunther Piez, Apr 04 '12 at 08:47
@Mystical: I see 0.75 cycles per loop iteration added. (Wasn't it my comment about using 5 cycles instead of 4 which lead you to your answer there... :-)) — Gunther Piez, Apr 04 '12 at 08:49
@drhirsch Of course [everyone is afraid of denormalized values](http://stackoverflow.com/questions/9314534/why-does-changing-0-1f-to-0-slow-down-performance-by-10x)... Another thing to try is to replace the weight vector with SIMD blocks of identical values. That'll let you do a normal load and not need to shuffle. — Mysticial, Apr 04 '12 at 08:51
@DanLeakin It would be helpful if you posted the actual cycle counts as measured instead of the basically useless Flops/cycle value, instead of letting us deduce it. — Gunther Piez, Apr 04 '12 at 08:52
@drhirsch Yeah, your comment did indeed tip me in the right direction. :) This one is harder though... Hard to suspect anything in particular. There's too much in a modern CPU. :P — Mysticial, Apr 04 '12 at 08:53
@Mystical Actually the answer you have given there was my very first thought. 5 loads - 5 cycles - easy to see the coincidence. But then I remembered that _my_ current SB is able to do 2 loads per cycle, ignoring the fact that the question was about a Nehalem and so I decided this could be the answer :-) — Gunther Piez, Apr 04 '12 at 08:56
@drhirsch Yeah, I also hesitated because I thought scalar loads could be multiple issue on Nehalem. Apparently I was wrong when I took a look at Agner's tables. Nehalem isn't able to split its 128-bit/cycle load bandwidth the way that SB can split its 256-bit/cycle into dual-issue SSE loads. — Mysticial, Apr 04 '12 at 09:04
Ok i tried to remove the `shufps` by using a `load` instruction, but the performance didn't increase, which to my opinion means that the `shufps` isn't the bad guy here. Any other explanations? Maybe the packed `movaps` instructions have some extra latency from cache stuff (misses, misalignment) that isn't there with the `movss` instructions in the scalar version? — Ricky, Apr 04 '12 at 09:17
For one, now you're demanding 4x the cache bandwidth. How large are the data sizes? Do they fit into the L1 cache? — Mysticial, Apr 04 '12 at 09:27
@DanLeakin Could you move the load _out_ of the loop and just remove the shufps completely? So that you have basically the same code, but every scalar instruction is replaced by a vector instruction? — Gunther Piez, Apr 04 '12 at 11:00
When moving the `load` out of the loop and thus removing the `shufps` instruction every iteration the performance remains almost the same (goes up by a little because one load is gone), so i assume it is caused by the cache — Ricky, Apr 04 '12 at 11:50
This is not exactly an answer to your question, but can't you use dpps? — Nathan Binkert, Apr 06 '12 at 17:53
I don't use FTZ or DAZ. @Necrolis thanks for the link, i'll check into that — Ricky, Apr 09 '12 at 12:30
If possible, I would use Intel Inspector (or its predecessor - VTune Performance Analyzer) to see where exactly performance is stalled. — Violet Giraffe, Apr 09 '12 at 14:06
i already analyzed the code using VTune, but this doesn't give much insight in the performance bottleneck at cycle level to my opinion — Ricky, Apr 10 '12 at 07:28
Do you have any sample data we can run to test it out ourselves? (Or a simple way of generating similar data.) — GManNickG, Apr 17 '12 at 02:43
of course, just precede the for loop with a loop initializing some values like `for(i=0;i<2*size;i++) input[i] = i/3; output[i] = i/5; weight[i] = i/8;` and keep the `ksize` in the loop low (mine is 6) — Ricky, Apr 17 '12 at 13:08

score 3 · Answer 1 · edited Apr 23 '12 at 17:22

Try using EMON profiling in Vtune, or some equivalent tool like oprof

Vtune for Linux (you can search for the Windows version)
oprofile

EMON (Event Monitoring) profiling => like a time based tool, but it can tell you what performance event is causing the problem. Although, you should start out with a time based profile first, to see if there is a particular instruction that jumps out. (And possibly the related events that tell you how often there was a retirement stall at that IP.)

To use EMON profiling, you must run through a list of events, ranging from "the usual suspects" to ...

Here, I would start off with cache misses, alignment. I do not know if the processor you are using has a counter for RF port limitations - it should - but I added EMON profiling long ago, and I don't know how well they are keeping up by adding events appropriate for microarchitecture.

It may also be possible that it is a front end, instruction fetch, stall. How many bytes are in these instructions, anyway? There are EMON events for that, too.

Responding to comment that Nehalem VTune can't see L3 events: not true. Here is stuff I was adding to comment, but did not fit:

Actually, there ARE performance counters for the LL3 / L3$ / so-called Uncore. I would be immensely surprised if VTune doesn't support them. See http://software.intel.com/sites/products/collateral/hpc/vtune/performance_analysis_guide.pdf points to VTune and other tools such as PTU. In fact, even without LL3 events, as David Levinthal says: "the Intel® Core™ i7 processor has a “latency event” which is very similar to the Itanium® Processor Family Data EAR event. This event samples loads, recording the number of cycles between the execution of the instruction and actual deliver of the data. If the measured latency is larger than the minimum latency programmed into MSR 0x3f6, bits 15:0, then the counter is incremented. Counter overflow arms the PEBS mechanism and on the next event satisfying the latency threshold, the measured latency, the virtual or linear address and the data source are copied into 3 additional registers in the PEBS buffer. Because the virtual address is captured into a known location, the sampling driver could also execute a virtual to physical translation and capture the physical address. The physical address identifies the NUMA home location and in principle allows an analysis of the details of the cache occupancies." He also points, on page 35, to VTune events such as L3 CACHE_HIT_UNCORE_HIT and L3 CACHE_MISS_REMOTE_DRAM. Sometimes you need to look up the numeric codes and program them into VTune's lower level interface, but I think in this case it is visible in the pretty user interface.

OK, in http://software.intel.com/en-us/forums/showthread.php?t=77700&o=d&s=lr a VTune programmer in Russia (I think) "explains" that you can't sample on Uncore events.

He's wrong - you could, for example, enable only one CPU, and sample meaningfully. I also believe that there is the ability to mark L3 missing data as it returns to the CPU. In fact, overall the L3 knows which CPU it is returning data to, so you can definitely sample. You may not know which hyperthread, but again you can disable, go into single thread mode.

But it looks like, as is rather common, you would have to work AROUND VTune, not with it, to do this.

Try latency profiling first. That's entirely inside the CPU, and the VTune folks are unlikely to have messed it up too much.

And, I say again, likelihood is that your problem is in the core, not in L3. So VTune should bne able to handle that.

Try "Cycle Accounting" per Levinthal.

Thanks for your reaction. I use VTune to analyze my application, but the problem with the nehalem architecture is that the L3 cache belongs to the `off-core` part of the core, so there are no performance event counters available for this part. Therefore it is hard to estimate cache misses etcetera. — Ricky, Apr 23 '12 at 12:44
Actually, there ARE performance counters for the LL3 / L3$ / so-called Uncore. I would be immensely surprised if VTune doesn't support them. See http://software.intel.com/sites/products/collateral/hpc/vtune/performance_analysis_guide.pdf — Krazy Glew, Apr 23 '12 at 16:40
I wrote more than would fit in comment, tried to move it to the answer and clean up the original comment, but comments can only bwe edited for 5 minutes. Short version: VTune allows you to see L3 cache misses. Even without Uncore support, using latency profiling - and it has Uncore support. — Krazy Glew, Apr 23 '12 at 16:51
And overall I suspect that your problem is not L3 cache misses. More likely a front end event. — Krazy Glew, Apr 23 '12 at 16:52
@KrazyGlew: Your guess is right, he is a Russian guy from Russian Federation. Here is his profile on LinkedIn - http://www.linkedin.com/in/vtsymbal — , Apr 23 '12 at 17:24
@Vlad_Lazarenko: By the way, I certainly do not mean to diss Vlad Tsymbal. In general, Intel's Russian teams were great to work with. I did, however, let my decades spanning frustration with VTune show. A good performance analyst always thinks about disabling stuff in order to measure stuff like L3 cache misses, if that's what it takes. VTune is supposed to encapsulate the knowledge of a good performance analyst. — Krazy Glew, Apr 24 '12 at 14:06
// As for the hardware not allowing attribution of LLC misses to CPU - that's silly. Either VTune missed something or the HW missed obvious fixes: (a) there should be marking, and (b) the information is there in the hardware, since the cache miss must be routed back to the correct requesting CPU. — Krazy Glew, Apr 24 '12 at 14:08
Indeed, Levinthal says no attribution. Unfortunate. But this is the way HW/SW codevelopment works: if the VTune guys provide software to do attribution by disabling cores and threads, then perhaps there will be justification for HW to do a better job next time. // BTW http://www.intel.com/Assets/en_US/PDF/designguide/323535.pdf says that you can do MEM_LOAD_RETIRED.LLC_MISS PEBS profiling, so there is yet another way to measure LLC misses. — Krazy Glew, Apr 24 '12 at 14:14
... and use in a profile. Better yet, with PEBs you know where the (non-speculative) miss actually occurred. — Krazy Glew, Apr 24 '12 at 15:10

C code loop performance [continued]

1 Answers1

Linked