Is there anyway to increase the range of the loop command when a loop command must be used

Question

outerLoop:
    skipCarry:
        mov     eax, 2
        mov     inCompare, eax   ;Set our inCompare to 2
        mov     eax, outCompare 
        push    ecx ;Save status of these registers before entering the innerLoop
        push    eax
        mov     ecx, innerLoopCount

    isComposite:
        mov     eax, outCompare ;Value to be checked for composite
        mov     edx, 0  ;Make sure edx does not have a left over value
        div     inCompare   ;Divide outCompare by inCompare to see if we get a remainder of 0 in edx
        cmp     edx, 0
        jne     skipPrint   ;If the remainder is not 0, we have not proven that the number in question is composite, so do not print
        ;Print out a composite value if its been proven
        mov     eax, outCompare
        call    WriteDec
        mov     edx, OFFSET spaces
        call    WriteString     ;Write spaces between the composites
        mov     ebx, writeCounter
        inc     ebx     ;This will help to make sure we start a new line after writing 10 composites
        mov     writeCounter, ebx
        cmp     ebx, 10     ;If counter is at 10, we start a new line
        jne     exitInnerLoop   ;exit the inner loop because we found that the number in question is composite
        call    CrLf
        mov     writeCounter, 0
        jmp     exitInnerLoop       ;exit the inner loop because we found that the number in question is composite

        skipPrint:      ;Jumped to if the number in question was not confirmed to be composite

        ;This is to continue testing for composite of the number in question by incrementing the divisor
        mov     ebx, inCompare
        inc     ebx
        mov     eax, outCompare
        cmp     eax, ebx
        je      exitInnerLoop
        mov     inCompare, ebx

        skipIncrement:  ;Will be jumped to if there are no new divisors to be tested

        loop    isComposite     ;Retest with new divisior if there is a possibility
        exitInnerLoop:

    pop     eax     ;restore outer loop status
    pop     ecx
    inc     eax     ;Next number to check for being composite
    mov     outCompare, eax
    loop    outerLoop

I get the error code saying that the jump is too long by 5 bytes referring to the outerLoop. I know that it would be simpler to just use the jmp command and keep track of the counter myself, but my homework assignment requires that the loop is used instead. Is there anyway to extend the range of the loop or is there a way to shrink the content in the loops by 5 bytes that I am not seeing?

The loops are supposed to be checking if a number is composite, just so that the functionality is known.

about shortening the code... like any time you wish, there's ton of cruft, you should rather keep values in registers than in memory, put print code as whole into subroutine, etc... I did similar answer on codereview site, you may check it for some tricks and ideas (and forcing you to use `loop` is [evil](https://stackoverflow.com/q/35742570/4271923)): https://codereview.stackexchange.com/questions/148315/fasm-assembly-write-a-program-which-checks-if-a-number-is-a-prime-number/150779#150779 — Ped7g, Nov 03 '17 at 00:44
@Ped7g: I spend way too much time optimizing this loop, but it was fun. My outer `loop` only has to jump backwards 55 bytes, and the function doesn't touch memory other than to print, or to reload `innerLoopCount` (although it would be better to set it from sqrt(N). Or maybe update it cheaply by only recalculating every time you print a newline.) — Peter Cordes, Nov 03 '17 at 05:04

Peter Cordes · Accepted Answer · 2017-11-03T05:08:59.490

No, loop/loopne only have one encoding each, using a SHORT (signed 8-bit rel8) relative displacement.

x86 since 386 has jcc rel32 aka NEAR conditional branches as well as jcc rel8. If you need to jump farther than rel8, loop is the wrong choice for small code size. dec ecx / jnz is mostly equivalent, except it clobbers some flags.

Code-size is one of the only reasons you should ever use loop; it's slow on everything except recent AMD). (speed ~= code-size on 8086, so it's useful there. Later CPUs made it slow on purpose for compatibility with (bad?) delay loop in drivers.

If you really want to jump farther with loop, you could do this:

    loop  .stay_in_loop

    jmp   .leave_loop
.stay_in_loop:
    jmp .far_away
.leave_loop:

The .stay_in_loop block can be after the end of the function, or somewhere else that's never reached but within 127B, which lets you omit the jmp .leave_loop.

You can also "cheat" and put some of the loop body out-of-line. e.g. you could move the Print block to after the end of the function (or before the beginning), and jcc to it and jmp back to inside the loop.

Using two nested loop instructions is just braindead, but it seems you're required to use that. push/pop to save/restore the outer loop's ecx isn't essential; You could use another register and mov ecx, esi or something. Either way kinda sucks because the push or mov esi, ecx has to be at the top of the loop. Unless you fake it by doing dec esi yourself instead of copying back the value produced by loop. This makes the code cleaner, but may be considered cheating / defeating the purpose of the requirement. :P

Perhaps the real goal of your assignment was to get you to optimize for code-size so a short jump can reach. That's definitely the right approach for code-quality overall in your case.

It's normal for beginner code to be painfully inefficient, so there are many easy local optimizations here. e.g. xor edx,edx is 2 bytes (and faster) than 5 byte mov edx, 0. Use test edx,edx to compare against zero. It's smaller and better, and sets flags identically to cmp edx,0

Also, keep your variables in registers. An instruction like mov writeCounter, 0 assembles to mov r/m32, imm32, using a disp32 addressing mode. So that's 8 bytes of address and data, plus 2 bytes of opcode + ModR/M. Every absolute address instead of a register operand costs you 4 bytes.

If you need to save even more code-size inside the loop, you could copy things to the stack and access them with [ebp+disp8] addressing modes (1 byte beyond ModR/M instead of 4).

To free up more registers for use in the loop, push/pop them at the start/end of your function.

Cleaned-up version of the code in the question

The displacement in the outer loop is 0xc9, or -55. So my version is under half the size of your version. (With 12 bytes of code in ResetWriteCounter outside the loop.)

    push   ebx
    push   ebp       ; we're NOT using it as a frame pointer
    push   edi
    push   esi


;;; Important to keep in registers
; edi = inCompare = trial divisor
; ebx = outCompare = prime/composite candidate

;; nice to keep in registers
; ebp = WriteCounter  (but done as a down-counter with dec instead of inc/cmp)
; esi = outer loop counter

;; these variable names are kinda bad, IMO.
;; candidate / divisor would be better names.

; ecx = inner loop counter
; eax,edx = scratch for DIV


    mov    ebx, outCompare    ; first composite candidate

    mov    ebp, 10            ; down-counter to trigger newlines
    push   '  '          ; two spaces followed by 2 zero bytes (i.e. null-terminated string)
   ; Use mov edx, esp instead of mov edx, OFFSET spaces in the print block

   ;;; note that code before this isn't part of the size that LOOP has to jump over.

outerLoop:
        test    bl, 1
        jz      Print    ;; multiple of 2
        ;; inner loop now only has to check the odd divisors

        mov     edi, 3    ; first trial divisor = first odd prime (other than 1)
        mov     ecx, innerLoopCount     ; ideally set to sqrt(ebx), e.g. cvtsi2ss xmm0, ebx / sqrtss xmm0,xmm0 / cvtss2si xmm0, ecx might be correct :P
                   ; could instead use ecx = ebx directly if you don't care about speed, so try divisors all the way up to the candidate.

    ;; inner loop
    isComposite:
        mov     eax, ebx         ; edx:eax = dividend = candidate
        xor     edx, edx
        div     edi              ; trial division

        add     edi, 2           ; next trial divisor = next odd number
        test    edx, edx         ; equivalent to cmp edx,0 to check remainder
        jz      Print            ; you could maybe fold this into a LOOPNZ and check ZF after the loop
   ;If the remainder is not 0, we have not proven that the number in question is composite, so do not print

        ;;cmp     edi, ebx        ;  divisor, candidate
        ;;jae     exitInnerLoop   ; what's the point of this?  isn't ecx set so that loop falls through after this many iterations?

        loop   isComposite     ;Retest with new divisor if there is a possibility

        jmp     NoPrint        ; else it was prime

;;; I put this outside the inner loop in the first place instead of needing to multiple jumps out of this to exitInnerLoop
    Print:   ;Print out a composite value if its been proven
        mov     eax, ebx
        call    WriteDec

        dec     ebp              ; new line when the down-counter reaches zero
        jz      ResetWriteCounter

        ; print spaces only if not printing a newline
        mov     edx, esp        ; OFFSET spaces;  Use stack memory we pushed earlier
        call    WriteString     ;Write spaces between the composites

    ;; tail of outer loop.  Prime numbers jump here without printing.
    exitInnerLoop:
    NoPrint:
        inc     ebx         ;Next number to check for being composite
                            ; TODO: increment by 2 and don't even check the even numbers, just print them after every odd prime/composite

        mov     ecx, esi    ; ecx = outer loop counter for the LOOP insn
        dec     esi         ; update outer counter instead of having  mov esi, ecx at the top of the loop
        loop    outerLoop

    add    esp, 4    ; clear the '  ' we pushed earlier.  pop esi  is smaller and about as efficiently

    pop    esi
    pop    edi
    pop    ebp
    pop    ebx
    ret

  ; after the end of the function is a good place for infrequently-use blocks that you jump to and then jump back.
 ResetWriteCounter:
        mov     ebp, 10         ; reset counter
        ; start a new line after writing 10 composites
        call    CrLf
        ; could have done this branchlessly by selecting a string of '  ',0 or 13,10,0 based on ebp being zero
        ; but doing the ebp update branchlessly too is worse than this highly predictable branch.  Maybe if the count was a power of 2, DEC/AND

        jmp  NoPrint          ; back where we came from, skipping the write spaces
       ; alternative: duplicate inc / mov / dec / loop block here

There's more you could do in terms of branch structure. e.g. I'd like to eliminate the jmp NoPrint after falling out of the loop. That seems clunky. Putting the Print block somewhere else would do the trick, but then Print would have to jump instead of falling through.

TODO: use ecx as the trial divisor. Probably much slower to count down than up, though, because small prime factors are more common. Also need to avoid 1.

I was going to just apply some of the local optimizations, but by the time I was done understanding your branch structure, I was having fun golfing it for code size. golf as in code-golf = fewest bytes wins.

See some of my x86 machine code golf answers for code-size tricks (including at the expense of speed, sometimes by a lot):

GCD in an 8 byte loop
Chroma-key blend two images (arrays) in 13 bytes.
Extreme Fibonacci: first 1000 digits of Fib(10^9) in 105 bytes of x86 32-bit code (with Linux system calls), doing extended precision adc with power-of-10 decimal limbs for easy truncation to keep just the most significant digits. This one is neat because there's also a performance requirement, so the inner loop is optimized for speed. (Whole thing runs in about 70 sec on my i7-6700k.)
Log-scales are for quitters: strlen and memset with rep strings to replace strings of N chars with base^N filler chars; in 21 bytes.
Do n and n^3 have the same set of decimal digits, 40 bytes of machine code. (Using bts to build a bitmap in a register).)

Is there anyway to increase the range of the loop command when a loop command must be used

1 Answers1

Cleaned-up version of the code in the question