0

I'm looking a way how to optimilize below code do avoid any stalls. I've added line of code before bne command, but I'm getting Branch Taken Stalls. I new in this so this is sort of magic to me.

.data

CONTROL: .word32 0x10000
DATA:    .word32 0x10008
NUMBER: .word32 0

.text
    daddi r1,r1,4     ; r1 = r1 + 4
    dadd r2,r1,r1    ; r2 = r1 + r1
    dadd r3,r2,r2    ; r3 = r2 + r2
    dadd r4,r3,r3    ; r4 = r3 + r3
    daddi r10,r0,0x4000 ; r10 = 1638
et: lw r1, 0(r0)     ; hazard RAW z pamiecia
    sw r4, 0(r0)    ; copy from register to memory
    dadd r2,r1,r1    ; 
    dadd r3,r2,r2    ;  
    dadd r4,r3,r3    ; 
    dadd r1,r1,r0
    bne r4,r10,et
    mtc1 r4,f4
    mtc1 r3,f3
    mul.d f5,f3,f4 ;multiple pipes
    add.d f6,f3,f4
    div.d f7,f3,f4
    mul.d f5,f3,f4 ;block RAW
    add.d f6,f3,f5
    mul.d f5,f3,f4 
    add.d f5,f3,f4
    add.d f5,f3,f4 
    mul.d f5,f3,f4 
    add.d f4,f3,f4 
        halt
        

mikeyMike
  • 21
  • 4
  • What kind of pipeline are you optimizing for? Some fake MIPS without branch-delay slots? Or some superscalar MIPS where the branch delay slot isn't enough to fully hide branch latency like it is on classic MIPS I R2000 machines? [Is that true if we can always fill the delay slot there is no need for branch prediction?](https://stackoverflow.com/q/34114739) – Peter Cordes Nov 24 '21 at 13:15
  • `lw r1, 0(r0)` only reads the zero-register; nothing in your program writes to that register. (And if the CPU special cases it, it doesn't need to stall anyway because `r0` always reads as zero). Is the *RAW hazard* comment on that line talking about it being the write (of r1) that's read by `dadd r2,r1,r1`, forming a RAW hazard? (Which isn't a problem for a [classic RISC pipeline](https://en.wikipedia.org/wiki/Classic_RISC_pipeline) with bypass forwarding; the `sw` fills the load delay slot.) Anyway, we need details on the microarchitecture you're working with. – Peter Cordes Nov 24 '21 at 13:26
  • The first one I think, I'm working on winmips64 – mikeyMike Nov 24 '21 at 13:40
  • With branch delay slots disabled, and no branch prediction or anything else, you'd expect every branch to incur 1 stall cycle. Check on your simulator settings. – Peter Cordes Nov 24 '21 at 13:46
  • Ok, got it, now it goes without Branch Taken Stalls. Is there a way to achieve same result, but only by adding new register or by adding/changing line of code? – mikeyMike Nov 24 '21 at 14:37
  • 1
    Same total performance, yeah sure, use a shift left by 3 instead of a chain of 3 adds. (Keep the original around so the code outside the loop can calculate the `r3 = value <<2` once after the loop, instead of every iteration) And remove `dadd r1,r1,r0`, adding 0 to anything is a NOP. Loop unrolling may help avoid stalls if not-taken branches are faster. But it looks like a data-dependent branch so you couldn't fully unroll the loop and remove all branch instructions. – Peter Cordes Nov 24 '21 at 15:12

0 Answers0