Peak FLOPs per cycle for ARM11 and Cortex-A7 cores in Raspberry Pi 1 and 2

Question

I would like to know the peak FLOPs per cycle for the ARM1176JZF-S core in the the Raspberry Pi 1 and Cortex-A7 cores in the Raspberry Pi 2.

From the ARM1176JZF-S Technical Reference Manual it seems that VFPv2 can do one SP MAC every clock cycle and one DP MAC every other clock cycle. In addition there are three pipelines which can operate in parallel: a MAC pipeline (FMAC), a division and sqrt pipeline (DS), and a load/store pipeline (LS). Based on this then it appears the ARM1176JZF-S of the Raspberry PI 1 can do at least (from the FMAC pipeline)

1 DP FLOP/cycle: one MAC/2 cycles
2 SP FLOPs/cycle: one MAC/cycle

Wikipedia claims the FLOPS of the raspberry PI 1 is 0.041 DP GFLOPS. Dividing by 0.700 GHz gives less than 0.06 DP FLOPs/cycle. That's about 17 times less than my estimate of 1 DP FLOP/cycle I get.

So what's the correct answer?

For the Cortex-A7 processor in the Raspberry Pi 2 I believe it's the same as the Cortex-A9. The FLOPs/cycle/core for the Cortex-A9 is:

1.5 DP FLOPs/cycle: scalar addition + scalar multiplication every other cycle
4 SP FLOPs/cycle: 4-wide NEON addition every other cycle + 4-wide NEON multiplication every other cycle.

Is the FLOPs/cycle/core for the Raspberry Pi 2 the same as for Corrtex-A9? If not, what is the correct answer?

Edit:

The main differences between the Cortex-A9 and Cortex-A7 (when it comes to peaks flops/cycle) are:

the Cortex-A9 is dual-issue (two instructions per clock) and the Cortex-A7 is only partially dual-issue "the A7 cannot dual-issue floating point or NEON instructions."
the Cortex-A9 is an out-of-order (OoO) processor and the Cortex-A7 is not.

I'm not sure why the OoO would affect the peak FLOPS. The dual issue certainly should. That would cut the peak FLOPS in half I think.

Edit: based on the table http://hardwarebug.org/2014/05/15/cortex-a7-instruction-cycle-timings/ Stephen Canon gave in a comment here are my new peak flops for the Cortex-A7

0.5 DP FLOPs/cycle: one VMLA.F64 (VFP) every four cycles.
1.0 DP FLOPs/cycle: one VADD.F64 (VFP) every cycle.
2.0 SP FLOPs/cycle: one VMLA.F32 (VFP) every cycle.
2.0 SP FLOPs/cycle: one VMLA.F32 (NEON) on two 32-bit floats every other cycle.

I'm aware of Integer SIMD computation on the [VideoCore-IV](https://github.com/hermanhermitage/videocoreiv/wiki/VideoCore-IV-Kernels-under-Linux). I'm not interested in this in this question. I'm only interested in the FLOPS of the ARM11 and Cortex-A7 cores. — Z boson, Jun 22 '15 at 09:44
My bad, somehow I misread and saw the discrepancy the wrong way round. — Notlikethat, Jun 22 '15 at 10:06
benchmarking is subjective, the only thing that matters is your favorite (or at least tolerable) compiler, with the code you plan to deploy and how fast that runs. Unless this is for marketing or advertising reasons, then just take from ARMs marketing folks and repeat that. — old_timer, Jun 22 '15 at 13:28
@dwelch, I'm talking about the peak flops which can be calculated. A benchmark should never get 100% of the peak. I want to know the theoretical best that a benchmark can obtain. — Z boson, Jun 23 '15 at 07:06
The 41 DP MFLOPS for 700 MHz RPi is probably based on the Linpack benchmark. My version obtains the same rating and 147 MFLOPS on 900 MHz RPi 2. My fastest SP MFLOPS test, with 32 multiply or add operations per data word read/written, achieves 192 MFLOPS on RPi, with RPi 2 at 410, then 709 via NEON (1581 4 cores). — Roy Longbottom, Jun 23 '15 at 10:29
@RoyLongbottom, thanks for the numbers! Your numbers are much less than my peak estimates. Likely I don't understand the hardware well enough yet. Are you sure though the benchmarks fully utilization the hardware (e.g. VFPv2 for RP1, NEON for RP2...)?? — Z boson, Jun 23 '15 at 10:44
You are not the only one not to understand. Same calculations via Linux on a 3.9 GHz Core i7 produces 24.6 GFLOPS out of possible 31.2 using SSE instructions and 31.2 GFLOPS out of 62.4 with AVX 1. I will provide disassembled code in an Answer. — Roy Longbottom, Jun 23 '15 at 21:27
The Cortex-A7 FPU is definitely not the same as Cortex-A9. I don't know of any public timing documentation from ARM, but a quick search does turn up this table of timing characteristics that someone compiled: http://hardwarebug.org/2014/05/15/cortex-a7-instruction-cycle-timings/ — Stephen Canon, Jun 23 '15 at 22:40
@RoyLongbottom, on x86 I'm able to obtain over 95% of the peak for operations such as `sum += a*b` with SSE and AVX. But I'm totally new to ARM so I don't know what the peak should be and are no very little about the hardware. The technical reference manual I mentioned says these MACS can happend in one cycle for float (two for double) and the Cortex-A9 is pipelined so why can't it sustain the peaks I claim? — Z boson, Jun 24 '15 at 07:55
@StephenCanon, so I read up a little. The main difference between A7 and A9 is that the A7 cannot dual-issue and is not OoO. The lack of dual-issue would cut the peak FLOPS in half I think but I don't see why OoO would matter to the peak. In any case half of the A9 peak is still a lot more the benchmarks I have seen so far — Z boson, Jun 24 '15 at 09:04
@Zboson: There's more to it than that. The NEON ALU in Cortex-A9 isn't really four-wide; it's two-wide (meaning q-register operations simply take two execution slots, one for the low d-register, one for the high d-register). Based on the timings I linked, the Cortex-A7 FP ALU is only one-wide, so q-register FP operations take four execution slots. — Stephen Canon, Jun 24 '15 at 14:44
This is called "double-pumping" or "quad-pumping", and it's a common technique to provide vector ISA compatibility on very limited low-power (or low-cost) parts. Intel did it on the original Core processors, for example. — Stephen Canon, Jun 24 '15 at 14:46
@StephenCanon, so then my estimate would be off by a factor of 4 then so 1 SP FLOPs/cycle. That's not so far off from RoyLongbottom's estimate of 709 MFLOPS with NEON. — Z boson, Jun 24 '15 at 15:12
@Zboson: *If* the table is correct, peak should be 2 SP flop/cycle via `VMLA.f32`, but it's possible there's a dependency not reflected in that table. — Stephen Canon, Jun 24 '15 at 15:22
@StephenCanon, okay I got it. One VMLA.f32 double 32-bit word = 4 FLOPs and the inverse throughput is 2 so 2 SP flops/cycle. — Z boson, Jun 24 '15 at 15:32
@StephenCanon, I updated my question with new peaks based on the table you provided and your comments. My new estimates are about two times larger than Roy Longbottom's benchmarks. — Z boson, Jun 24 '15 at 15:40
@StephenCanon, apprently [Cortex A7 has FMA NEON](http://stackoverflow.com/questions/15227278/arm-neon-simd-version-2) but the table you listed does not have it. I don't think it would make a difference to the peak FLOPs/cycle though since it's likely at least as slow as VMLA.f32. — Z boson, Jun 25 '15 at 08:07
@StephenCanon, sorry to belabour this but I just noticed that VMLA.F32 using VFP takes one cycle whereas VMLA.F32 using NEON takes 2 cycles. This means they both do 2 FLOPS/cycle. i.e. NEON is no better than VFP (at least for floats, I have not checked ints). — Z boson, Jun 26 '15 at 07:52
Makes sense, I would expect that to be the case on a machine with a scalar f32 unit that's multiply-pumped to support a vector ISA; the only real benefit of using the vector instructions will be code density (which may help with decode/issue limits if you encounter them). Also, VFP doesn't have integer instructions (except for converts to/from FP). — Stephen Canon, Jun 26 '15 at 10:37
@StephenCanon, I think the point in Cortex-A7 is that it is 100% binary instruction set compatible with the Cortex-A15 (the is not necessarily true with the Cortex-A8 and Cortex-A9). This is why they are used in big.LITTLE. The Cortex-A7 is not designed to be fast (that's the Cortex-A15's job). So Neon is there to be compatible. I understand now what you mean by quad-pumped: 4-wide Neon on the Cortex-A7 is the same as 4 scalar operations. — Z boson, Jun 26 '15 at 10:47

Roy Longbottom · Answer 1 · 2015-06-23T22:25:42.997

Example 1 Compiled code MP-MFLOPSPiNeon that obtains >647 MFLOPS (data words 3.2k to 3.2M) on a 900 MHz Rpi2. Disassembly seems to be the same without threading. Compile/link command used and C code for 32 operations per data word are below [Someone might suggest faster compile options].

   MP-MFLOPS Compiled NEON v1.0

   gcc mpmflops.c cpuidc.c -lrt -lc -lm -O3 -mcpu=cortex-a7
    -mfloat-abi=hard -mfpu=neon-vfpv4 -funsafe-math-optimizations -lpthread -o MP-MFLOPSPiNeon

   32 OPs/Word 1 CPU 692 MFLOPS

 void triadplus2(int n, float a, float b, float c, float d,
                 float e, float f, float g, float h, float j,
                 float k, float l, float m, float o, float p,
                 float q, float r, float s, float t, float u,
                 float v, float w, float y, float *x)
 {
     int i;
     for(i=0; i<n; i++)
     x[i] = (x[i]+a)*b-(x[i]+c)*d+(x[i]+e)*f-(x[i]+g)*h+(x[i]+j)*k
     -(x[i]+l)*m+(x[i]+o)*p-(x[i]+q)*r+(x[i]+s)*t-(x[i]+u)*v+(x[i]+w)*y;
 }

Following is complex disassembly. Note highlighted fused multiply accumulate or subtract instructions with an excessive number of loads

 triadplus2:

    @ args = 24, pretend = 0, frame = 272
    @ frame_needed = 0, uses_anonymous_args = 0
    @ link register save eliminated.
    stmfd   sp!, {r4, r5, r6, r7}
    cmp     r0, #0
    fstmfdd sp!, {d8, d9, d10, d11, d12, d13, d14, d15}
    sub     sp, sp, #272
    flds    s21, [sp, #352]
    flds    s18, [sp, #356]
    flds    s19, [sp, #360]
    flds    s16, [sp, #364]
    flds    s20, [sp, #368]
    flds    s17, [sp, #372]
    ble     .L57
    sbfx    r3, r1, #2, #1
    and     r3, r3, #3
    cmp     r3, r0
    movcs   r3, r0
    cmp     r0, #4
    movls   r3, r0
    bhi     .L80
 LOOP HERE  
.L59:
    flds    s23, [r1]
    cmp     r3, #1
    fadds   s22, s23, s4
    movls   r2, #1
    fadds   s24, s23, s0
    fadds   s31, s23, s8
    fadds   s30, s23, s12
    fmuls   s22, s22, s5
    fadds   s29, s23, s21
    fadds   s28, s23, s20
    fadds   s27, s23, s6
    vfma.f32        s22, s24, s1
    fadds   s26, s23, s2
    fadds   s25, s23, s10
    fadds   s24, s23, s14
    fadds   s23, s23, s19
    vfma.f32        s22, s31, s9
    vfma.f32        s22, s30, s13
    vfma.f32        s22, s29, s18
    vfma.f32        s22, s28, s17
    vfms.f32        s22, s27, s7
    vfms.f32        s22, s26, s3
    vfms.f32        s22, s25, s11
    vfms.f32        s22, s24, s15
    vfms.f32        s22, s23, s16
    fsts    s22, [r1]
    bls     .L61
    flds    s23, [r1, #4]
    cmp     r3, #2
    fadds   s22, s23, s4
    movls   r2, #2
    fadds   s24, s23, s0
    fadds   s31, s23, s8
    fadds   s30, s23, s12
    fmuls   s22, s22, s5
    fadds   s29, s23, s21
    fadds   s28, s23, s20
    fadds   s27, s23, s6
    vfma.f32        s22, s24, s1
    fadds   s26, s23, s2
    fadds   s25, s23, s10
    fadds   s24, s23, s14
    fadds   s23, s23, s19
    vfma.f32        s22, s31, s9
    vfma.f32        s22, s30, s13
    vfma.f32        s22, s29, s18
    vfma.f32        s22, s28, s17
    vfms.f32        s22, s27, s7
    vfms.f32        s22, s26, s3
    vfms.f32        s22, s25, s11
    vfms.f32        s22, s24, s15
    vfms.f32        s22, s23, s16
    fsts    s22, [r1, #4]
    bls     .L61
    flds    s23, [r1, #8]
    cmp     r3, #3
    fadds   s22, s23, s4
    movls   r2, #3
    fadds   s24, s23, s0
    fadds   s31, s23, s8
    fadds   s30, s23, s12
    fmuls   s22, s22, s5
    fadds   s29, s23, s21
    fadds   s28, s23, s20
    fadds   s27, s23, s6
    vfma.f32        s22, s24, s1
    fadds   s26, s23, s2
    fadds   s25, s23, s10
    fadds   s24, s23, s14
    fadds   s23, s23, s19
    vfma.f32        s22, s31, s9
    vfma.f32        s22, s30, s13
    vfma.f32        s22, s29, s18
    vfma.f32        s22, s28, s17
    vfms.f32        s22, s27, s7
    vfms.f32        s22, s26, s3
    vfms.f32        s22, s25, s11
    vfms.f32        s22, s24, s15
    vfms.f32        s22, s23, s16
    fsts    s22, [r1, #8]
    bls     .L61
    flds    s23, [r1, #12]
    mov     r2, #4
    fadds   s22, s23, s20
    fadds   s24, s23, s21
    fadds   s31, s23, s12
    fadds   s30, s23, s8
    fmuls   s22, s22, s17
    fadds   s29, s23, s4
    fadds   s28, s23, s0
    fadds   s27, s23, s6
    vfma.f32        s22, s24, s18
    fadds   s26, s23, s2
    fadds   s25, s23, s10
    fadds   s24, s23, s14
    fadds   s23, s23, s19
    vfma.f32        s22, s31, s13
    vfma.f32        s22, s30, s9
    vfma.f32        s22, s29, s5
    vfma.f32        s22, s28, s1
    vfms.f32        s22, s27, s7
    vfms.f32        s22, s26, s3
    vfms.f32        s22, s25, s11
    vfms.f32        s22, s24, s15
    vfms.f32        s22, s23, s16
    fsts    s22, [r1, #12]
 .L61:
    cmp     r3, r0
    beq     .L57
    rsb     r6, r3, r0
    mov     r4, r6, lsr #2
    movs    r7, r4, asl #2
    beq     .L63
 .L81:
    vdup.32 q12, d1[1]
    vdup.32 q8, d0[0]
    vdup.32 q10, d0[1]
    vdup.32 q11, d1[0]
    vstr    d24, [sp, #64]
    vstr    d25, [sp, #72]
    vdup.32 q12, d3[1]
    vstr    d16, [sp, #16]
    vstr    d17, [sp, #24]
    vstr    d20, [sp, #32]
    vstr    d21, [sp, #40]
    vdup.32 q8, d2[0]
    vdup.32 q10, d2[1]
    vstr    d22, [sp, #48]
    vstr    d23, [sp, #56]
    vstr    d24, [sp, #128]
    vstr    d25, [sp, #136]
    vdup.32 q11, d3[0]
    vdup.32 q12, d5[1]
    vstr    d16, [sp, #80]
    vstr    d17, [sp, #88]
    vstr    d20, [sp, #96]
    vstr    d21, [sp, #104]
    vdup.32 q8, d4[0]
    vdup.32 q10, d4[1]
    vstr    d22, [sp, #112]
    vstr    d23, [sp, #120]
    vstr    d24, [sp, #192]
    vstr    d25, [sp, #200]
    vdup.32 q11, d5[0]
    vdup.32 q12, d10[0]
    vstr    d16, [sp, #144]
    vstr    d17, [sp, #152]
    vstr    d20, [sp, #160]
    vstr    d21, [sp, #168]
    vstr    d22, [sp, #176]
    vstr    d23, [sp, #184]
    vdup.32 q8, d6[0]
    vdup.32 q10, d9[1]
    vdup.32 q11, d8[0]
    vstr    d24, [sp, #256]
    vstr    d25, [sp, #264]
    vdup.32 q12, d8[1]
    vstr    d16, [sp, #208]
    vstr    d17, [sp, #216]
    vdup.32 q7, d6[1]
    vdup.32 q6, d7[0]
    vdup.32 q15, d7[1]
    vdup.32 q14, d10[1]
    vdup.32 q13, d9[0]
    vstr    d20, [sp, #224]
    vstr    d21, [sp, #232]
    vstr    d22, [sp, #240]
    vstr    d23, [sp, #248]
    vst1.64 {d24-d25}, [sp:64]
    add     r3, r1, r3, asl #2
    mov     ip, #0
    mov     r5, r3
 .L69:

    vfma FUSED MULTIPLY ACCUMULATE or vfms SUBTRACT QUAD WORDS

    vld1.64 {d18-d19}, [r3:64]!
    vldr    d20, [sp, #80]
    vldr    d21, [sp, #88]
    vldr    d22, [sp, #16]
    vldr    d23, [sp, #24]
    vadd.f32        q8, q9, q10
    vldr    d24, [sp, #96]
    vldr    d25, [sp, #104]
    vadd.f32        q10, q9, q11
    vmul.f32        q8, q8, q12
    vldr    d22, [sp, #32]
    vldr    d23, [sp, #40]
    vldr    d24, [sp, #144]
    vldr    d25, [sp, #152]
    vfma.f32        q8, q10, q11
    add     ip, ip, #1
    vadd.f32        q11, q9, q12
    vldr    d24, [sp, #208]
    vldr    d25, [sp, #216]
    cmp     r4, ip
    vadd.f32        q10, q9, q12
    vldr    d24, [sp, #160]
    vldr    d25, [sp, #168]
    vfma.f32        q8, q11, q12
    vadd.f32        q11, q9, q14
    vldr    d24, [sp, #256]
    vldr    d25, [sp, #264]
    vfma.f32        q8, q10, q7
    vadd.f32        q10, q9, q12
    vldr    d24, [sp, #112]
    vldr    d25, [sp, #120]
    vfma.f32        q8, q11, q13
    vadd.f32        q11, q9, q12
    vld1.64 {d24-d25}, [sp:64]
    vfma.f32        q8, q10, q12
    vldr    d24, [sp, #48]
    vldr    d25, [sp, #56]
    vadd.f32        q10, q9, q12
    vldr    d24, [sp, #128]
    vldr    d25, [sp, #136]
    vfms.f32        q8, q11, q12
    vldr    d24, [sp, #176]
    vldr    d25, [sp, #184]
    vadd.f32        q11, q9, q12
    vldr    d24, [sp, #64]
    vldr    d25, [sp, #72]
    vfms.f32        q8, q10, q12
    vldr    d24, [sp, #224]
    vldr    d25, [sp, #232]
    vadd.f32        q10, q9, q6
    vadd.f32        q9, q9, q12
    vldr    d24, [sp, #192]
    vldr    d25, [sp, #200]
    vfms.f32        q8, q11, q12
    vfms.f32        q8, q10, q15
    vldr    d20, [sp, #240]
    vldr    d21, [sp, #248]
    vfms.f32        q8, q9, q10
    vst1.64 {d16-d17}, [r5:64]!
    bhi     .L69

    END vfma FUSED MULTIPLY ACCUMULATE or vfms SUBTRACT QUAD WORDS

    cmp     r7, r6
    add     r2, r2, r7
    beq     .L57
 .L63:
    add     ip, r1, r2, asl #2
    add     r3, r2, #1
    cmp     r0, r3
    flds    s23, [ip]
    fadds   s22, s23, s4
    fadds   s24, s23, s0
    fadds   s31, s23, s8
    fadds   s30, s23, s12
    fmuls   s22, s22, s5
    fadds   s29, s23, s21
    fadds   s28, s23, s20
    fadds   s27, s23, s2
    vfma.f32        s22, s24, s1
    fadds   s26, s23, s6
    fadds   s25, s23, s14
    fadds   s24, s23, s10
    fadds   s23, s23, s19
    vfma.f32        s22, s31, s9
    vfma.f32        s22, s30, s13
    vfma.f32        s22, s29, s18
    vfma.f32        s22, s28, s17
    vfms.f32        s22, s27, s3
    vfms.f32        s22, s26, s7
    vfms.f32        s22, s25, s15
    vfms.f32        s22, s24, s11
    vfms.f32        s22, s23, s16
    fsts    s22, [ip]
    ble     .L57
    add     r3, r1, r3, asl #2
    add     r2, r2, #2
    cmp     r0, r2
    flds    s23, [r3]
    fadds   s22, s23, s4
    fadds   s24, s23, s0
    fadds   s31, s23, s8
    fadds   s30, s23, s12
    fmuls   s22, s22, s5
    fadds   s29, s23, s21
    fadds   s28, s23, s20
    fadds   s27, s23, s6
    vfma.f32        s22, s24, s1
    fadds   s26, s23, s2
    fadds   s25, s23, s10
    fadds   s24, s23, s14
    fadds   s23, s23, s19
    vfma.f32        s22, s31, s9
    vfma.f32        s22, s30, s13
    vfma.f32        s22, s29, s18
    vfma.f32        s22, s28, s17
    vfms.f32        s22, s27, s7
    vfms.f32        s22, s26, s3
    vfms.f32        s22, s25, s11
    vfms.f32        s22, s24, s15
    vfms.f32        s22, s23, s16
    fsts    s22, [r3]
    ble     .L57
    add     r2, r1, r2, asl #2
    flds    s22, [r2]
    fadds   s4, s22, s4
    fadds   s0, s22, s0
    fadds   s8, s22, s8
    fadds   s12, s22, s12
    fmuls   s5, s4, s5
    fadds   s21, s22, s21
    fadds   s20, s22, s20
    fadds   s6, s22, s6
    vfma.f32        s5, s0, s1
    fadds   s2, s22, s2
    fadds   s10, s22, s10
    fadds   s14, s22, s14
    fadds   s19, s22, s19
    vfma.f32        s5, s8, s9
    vfma.f32        s5, s12, s13
    vfma.f32        s5, s21, s18
    vfma.f32        s5, s20, s17
    vfms.f32        s5, s6, s7
    vfms.f32        s5, s2, s3
    vfms.f32        s5, s10, s11
    vfms.f32        s5, s14, s15
    vfms.f32        s5, s19, s16
    fsts    s5, [r2]
 .L57:
    add     sp, sp, #272
    @ sp needed
    fldmfdd sp!, {d8-d15}
    ldmfd   sp!, {r4, r5, r6, r7}
    bx      lr
 .L80:
    cmp     r3, #0
    moveq   r2, r3
    bne     .L59

    rsb     r6, r3, r0
    mov     r4, r6, lsr #2
    movs    r7, r4, asl #2
    bne     .L81
    b       .L63
    .size   triadplus2, .-triadplus2

Example 2 - Using NEON intrinsic functions (from before I knew of fused instructions) > 700 MFLOPS. First C code:

 32 Operations per word
 C NEON Intrinsics 
 n = words 3.2k, 32k, 3.2M
 similar results > 700 MFLOPS.

 for(i=0; i<n; i=i+4)
 {
     x41 = vld1q_f32(ptrx1);

     z41 = vaddq_f32(x41, a41);
     z41 = vmulq_f32(z41, b41);

     z42 = vaddq_f32(x41, c41);
     z42 = vmulq_f32(z42, d41);
     z41 = vsubq_f32(z41, z42);

     z42 = vaddq_f32(x41, e41);
     z42 = vmulq_f32(z42, f41);
     z41 = vaddq_f32(z41, z42);

     z42 = vaddq_f32(x41, g41);
     z42 = vmulq_f32(z42, h41);
     z41 = vsubq_f32(z41, z42);

     z42 = vaddq_f32(x41, j41);
     z42 = vmulq_f32(z42, k41);
     z41 = vaddq_f32(z41, z42);

     z42 = vaddq_f32(x41, l41);
     z42 = vmulq_f32(z42, m41);
     z41 = vsubq_f32(z41, z42);

     z42 = vaddq_f32(x41, o41);
     z42 = vmulq_f32(z42, p41);
     z41 = vaddq_f32(z41, z42);

     z42 = vaddq_f32(x41, q41);
     z42 = vmulq_f32(z42, r41);
     z41 = vsubq_f32(z41, z42);

     z42 = vaddq_f32(x41, s41);
     z42 = vmulq_f32(z42, t41);
     z41 = vaddq_f32(z41, z42);

     z42 = vaddq_f32(x41, u41);
     z42 = vmulq_f32(z42, v41);
     z41 = vsubq_f32(z41, z42);

     z42 = vaddq_f32(x41, w41);
     z42 = vmulq_f32(z42, y41);
     z41 = vaddq_f32(z41, z42);

     vst1q_f32(ptrx1, z41);

     ptrx1 = ptrx1 + 4;
 }

Next is disassembly, again with excessive load instructions.

Assembly Code

.L26:
    vld1.32 {d16-d17}, [ip]
    vld1.64 {d20-d21}, [sp:64]
    vadd.f32        q9, q8, q14
    vadd.f32        q11, q8, q10
    vldr    d24, [sp, #16]
    vldr    d25, [sp, #24]
    vmul.f32        q11, q11, q13
    vmul.f32        q9, q9, q12
    vldr    d24, [sp, #32]
    vldr    d25, [sp, #40]
    vsub.f32        q11, q11, q9
    vadd.f32        q10, q8, q12
    vldr    d18, [sp, #48]
    vldr    d19, [sp, #56]
    vldr    d24, [sp, #64]
    vldr    d25, [sp, #72]
    vmul.f32        q10, q10, q9
    vadd.f32        q9, q8, q12
    vadd.f32        q11, q11, q10
    vldr    d20, [sp, #80]
    vldr    d21, [sp, #88]
    vldr    d24, [sp, #96]
    vldr    d25, [sp, #104]
    vmul.f32        q9, q9, q10
    vadd.f32        q10, q8, q12
    vsub.f32        q11, q11, q9
    vldr    d18, [sp, #112]
    vldr    d19, [sp, #120]
    vldr    d24, [sp, #128]
    vldr    d25, [sp, #136]
    vmul.f32        q10, q10, q9
    vadd.f32        q9, q8, q12
    vadd.f32        q11, q11, q10
    vldr    d24, [sp, #160]
    vldr    d25, [sp, #168]
    vldr    d20, [sp, #144]
    vldr    d21, [sp, #152]
    add     r3, r3, #4
    cmp     r0, r3
    vmul.f32        q9, q9, q10
    vadd.f32        q10, q8, q12
    vsub.f32        q11, q11, q9
    vmul.f32        q10, q10, q15
    vadd.f32        q9, q8, q3
    vadd.f32        q11, q11, q10
    vmul.f32        q9, q9, q2
    vadd.f32        q10, q8, q1
    vsub.f32        q11, q11, q9
    vmul.f32        q10, q10, q0
    vadd.f32        q9, q8, q4
    vadd.f32        q10, q11, q10
    vmul.f32        q9, q9, q5
    vadd.f32        q8, q8, q6
    vsub.f32        q10, q10, q9
    vmul.f32        q8, q8, q7
    vadd.f32        q10, q10, q8
    vst1.32 {d20-d21}, [ip]!
    bgt     .L26

To answer the OP's question, where and what is the peak number of floating operations simultaneously executing? The question is flops per cycle, so if four FLOps occur simultaneously sometimes and complete in once cycle, the answer is "4". — wallyk, Jun 23 '15 at 22:29
@wallyk, it order to say 4 flops per cycle then it needs to be throughput bound and not latency bound. I mean let's assume a 64-bit wite (two floats) MAC Neon instruction had a latency of 4 but a throughput of 1 then this would require that four of these instructions can happen in four clock cycles to claim 4 flops/cycle. A single instructions is not sufficient. — Z boson, Jun 24 '15 at 07:52

Peak FLOPs per cycle for ARM11 and Cortex-A7 cores in Raspberry Pi 1 and 2

1 Answers1

Linked