Collatz function in assembly - segmentation fault

Question

I am trying to write a hybrid program between C and x86-64 assembly language. This program should calculate the largest stopping time of a number between 1 and given parameter n using the Collatz function. The main function is written in C and in its for-loop it calls an external function written in assembly.

However, I am getting a segmentation fault when running the compiled hybrid program for values larger than 2. Using gdb I've found the error to be when I make the recursive call. This is the error I am getting:

Program received signal SIGSEGV, Segmentation fault.
0x00000000004006c3 in is_odd ()

C code:

#include <stdio.h>
#include <stdlib.h>

int noOfOp = 0;

extern int collatz(long long n);

// The main function. Main expects one parameter n.
// Then, it computes collatz(1), colllatz(2), ..., collataz(n) and finds the
// a number m, 1 <= m <= n with the maximum stopping time.
int main(int argc, char *argv[]){
    if (argc != 2) {
        printf("Parameter \"n\" is missing. \n");
        return -1;
    } else {
        int max=0;
        long long maxn=0;
        int tmp=0;
        long long n = atoll(argv[1]);
        for (long long i=1 ; i<=n ; i++) {
            tmp = collatz(i);
            if (tmp > max) {
                max = tmp;
                maxn=i;
            }
        }
        printf("The largest stopping time between 1 and %lld was %lld ", n,maxn);
        printf("with the stopping time of %d. \n", max);
    }
}

And this is the x86-64 assembly code I've written. I expect this code to reflect my lack of proper understanding of assembly, yet. This is an assignment in class of which we have been given four days to complete on this new topic. Normally I would have read more documentation but I simple am in lack of the time. And assembly language is hard.

.section .text
.global collatz

collatz:
    pushq   %rbp            # save old base pointer
    movq    %rsp, %rbp      # create new base pointer
    subq    $16, %rsp       # local variable space


    cmpq    $1, %rdi        # compare n to 1
    je      is_one          # if n = 1, return noOfOp

    incq    noOfOp          # else n > 1, then increment noOfOp
    movq    %rdi, %rdx      # move n to register rdx
    cqto                    # sign extend rdx:rax
    movq    $2, %rbx        # move 2 to register rbx
    idivq   %rbx            # n / 2 -- quotient is in rax, remainder in rdx
    cmpq    $1, %rdx        # compare remainder to 1
    je      is_odd          # if n is odd, jump to is_odd
    jl      is_even         # else n is even, jump to is_even
    leave                   # remake stack
    ret                     # return


is_odd:
    movq    %rdi, %rdx      # move n to register rdx
    cqto                    # sign extend rdx:rax
    movq    $3, %rbx        # move 3 to register rbx
    imulq   %rbx            # n * 3 -- result is in rax:rdx
    movq    %rax, %rdi      # move n to register rdi
    incq    %rdi            # n = n + 1
    call    collatz         # recursive call: collatz(3n+1) <---- this is where the segmentation fault seems to happen
    leave                   # remake stack
    ret                     # return

is_even:
    movq    %rax, %rdi      # n = n / 2 (quotient from n/2 is still in rax)
    call    collatz         # recursive call: collatz(n/2) <---- I seem to have gotten the same error here by commenting out most of the stuff in is_odd
    leave                   # remake stack
    ret                     # return

is_one:
    movq    noOfOp, %rax    # set return value to the value of noOfOp variable
    leave                   # remake stack
    ret                     # return

I appreciate any and all the help and suggestions I can get.

"Normally I would have read more documentation but I simple am in lack of the time" - So you outsource debugging to us? That's not how stack overflow works! We are not a debugging service. See [ask]. — too honest for this site, Dec 21 '16 at 02:35
Use the debugger to single step the code and verify the exit condition. Not sure why you expect it to stop, `3n+1` will grow to infinity even if you sometimes divide by `2`. PS: `rbx` is a callee-saved register. PS #2: using `idiv` for dividing by 2 is frowned upon :) PS #3: similarly for `imul` and `3`. — Jester, Dec 21 '16 at 02:36
What is the call interface for your C compiler when a function is called? Do you need to push the argument value onto the stack before doing the call to `collatz()` or is it passed in register `rdi`? What I would do is write the `collatz()` function in C and then run it through the entire program through the compiler to generate assembler output and see what the compiler is doing with the recursive call. — Richard Chambers, Dec 21 '16 at 02:40
You work with `noOfOp` as if it is a qword, but it's `int` in C. I never believe `int`s when I'm accessing ASM... `uint64_t` is preferred by me ("stdint.h" include). — Ped7g, Dec 21 '16 at 03:25
Your code has a lot of the same performance problems as this [slow-but-working hand-written asm vs. C++ question](http://stackoverflow.com/questions/40354978/why-is-this-c-code-faster-than-my-hand-written-assembly-for-testing-the-collat). See my answer there (and a couple of the others) for how to make it fast. — Peter Cordes, Dec 21 '16 at 22:07
So which instruction is `0x00000000004006c3 in is_odd ()`, and what are the register values at that point? You got half way to giving useful debugging information, but stopped just short. (See [ask] and [mcve]: I'm not going to spend my time looking at everything in your code when you can't be bothered to tell me exactly how it crashed.) — Peter Cordes, Dec 21 '16 at 22:08

score 2 · Answer 1 · edited Jun 20 '20 at 09:12

Two problems that I see just from inspecting the code:

noOfOp is declared as an int, which will be a 32-bit type on x86-64. Your assembly code, however, is treating it as if it were a 64-bit type. Specifically, where you increment it by one using incq. That should instead be incl noOfOp or addl $1, noOfOp.

Along the same lines, your collatz function is prototyped as returning an int, but your code suggests that you are trying to return a 64-bit value in rax. This won't cause any problems, because the caller will just use only the lower 32 bits, but it may cause correctness problems.
You are ignoring the calling convention when recursively calling the collatz function. Assuming that you are on Linux, the applicable one would be the System V AMD64 calling convention. Here, the RBP and RBX registers are callee-save. Therefore, you need to preserve their contents. Do be sure to familiarize yourself with the calling convention and follow its rules.

As one of the commenters suggested, it may be easiest to write the function first in C or C++, before translating it to assembly. This will also make it easier to debug, and it also makes it possible to see what code the compiler emits. You can check the compiler's output against your own hand-written assembly code.

There may be additional problems with your code that I didn't spot. You can find them for yourself by single-stepping through your code with a debugger. You are already using GDB, so this should be simple to do.

score 1 · Answer 2 · answered Dec 23 '16 at 19:11

After what Peter suggested in the comments above, I read what him and other brilliant people discussed in another thread of the same topic. This is the code I ended up with after implementing some of those ideas. This is now 30% faster than that compiled with gcc -O3. I cannot believe how much more faster the program can be these different "tricks" - I truly learned a lot about efficiency. Thank you to those who helped.

.section .text
.global collatz

collatz:
    pushq   %rbp                    # save old base pointer
    movq    %rsp, %rbp              # create new base pointer
    subq    $16, %rsp               # local variable space  
    movq    $-1, %r10               # start counter at -1

while_loop:
    incq    %r10                    # increment counter
    leaq    (%rdi, %rdi, 2), %rdx   # rdx = 2 * n + n
    incq    %rdx                    # rdx = 3n+1
    sarq    %rdi                    # rdi = n/2
    cmovc   %rdx, %rdi              # if CF, rdi = rdx
                                    # (if CF was set during right shift (i.e. n is odd) set rdi to 3n+1)
                                    # else keep rdi to n/2
    jnz     while_loop              # if n =/= 1 do loop again:
                                    # Z flag is only set if sarq shifts when n is 1 making result 0.
                                    # else
    movq    %r10, %rax              # set return value to counter
    leave                           # remake stack
    ret                             # return

MikL · Answer 3 · 2016-12-21T16:42:19.837

0

Thank you for all the answers. I apologize if my question did not follow Stack Overflow guidelines.
I meant that normally I would not bother others with this if I had more time. Instead I sought guidance - not presumed debugging service - that could lead me on the right path.

For anyone interested I got the program to work. I went with a different approach than originally posted and made some changes for speed up. Below is the new assembly code.

.section .text
.global collatz

collatz:
    pushq   %rbp            # save old base pointer
    movq    %rsp, %rbp      # create new base pointer
    subq    $16, %rsp       # local variable space

    cmpq    $1, %rdi        # compare n to 1
    je      is_one          # if n = 1, jump to is_one
                            # else n > 1
    incl    noOfOp          # increment noOfOp
    movq    %rdi, %rax      # move n to rax
    andq    $1, %rax        # AND 1 with n
    jz      is_even         # if n is even jump to is_even
                            # else n is odd
    movq    $3, %rdx        # move 3 to rdx
    imul    %rdx, %rdi      # n = 3 * n
    incq    %rdi            # n = 3n + 1
    call    collatz         # recursive call: collatz(3n+1)
    leave                   # remake stack
    ret                     # return

is_even:
    sarq    %rdi            # arithmetic right shift by 1 - divide n by 2
    call    collatz         # recursive call: collatz(n/2)
    leave                   # remake stack
    ret                     # return

is_one:
    movl    noOfOp, %eax    # set return value to noOfOp
    movl    $0, noOfOp      # reset noOfOp
    leave                   # remake stack
    ret                     # return

This works and is approx. 30% faster than the code I have written only in C. But I know from the assignment that I can shave off even more time making it more effective. If anyone has any ideas how to do so feel free to comment.

Thank you again.

edited Dec 21 '16 at 16:42

answered Dec 21 '16 at 16:22

MikL

11
3

1

If you compile your C with optimization enabled, it's hard to imagine that it would be slower than this asm. At least you're not using DIV anymore, but you're making a recursive call for every step!! There's a lot of other inefficiency here, like branching on odd/even. But even if you're going to do that, you could lay out your branches better, and do the tests more efficiently. (e.g. `test $1, %edi` to test the low bit, instead of a move and AND.) – Peter Cordes Dec 21 '16 at 22:16
If you're going to use IMUL instead of LEA (which is what you should really be using), at least use the immediate-operand form: `imul $3, %rdx, %rdi`. (Yes, [imul-immediate](http://www.felixcloutier.com/x86/IMUL.html) is special, and has room for a separate read-only source and write-only destination instead of operating in-place like `add $4, %eax` and the other original 8086 immediate instructions that use one of the operand fields as extra opcode bits in the machine-code encoding.) – Peter Cordes Dec 21 '16 at 22:17
Anyway, like I commented on the question, **see [my answer on another Collatz asm question](http://stackoverflow.com/questions/40354978/why-is-this-c-code-faster-than-my-hand-written-assembly-for-testing-the-collat/40355466#40355466) for exactly how to optimize the crap out of it for Intel Haswell, or AMD Bulldozer-family**. (The optimal asm differs for different microarchitectures). Many people contributed some great ideas for algorithmic improvements, like always doing two steps at once since `3*odd_n+1` is always even. – Peter Cordes Dec 21 '16 at 22:24
Oh, reading Cody's answer, I just realized you're incrementing your counter *in memory* every iteration, instead of keeping it in one of the 15 general purpose registers (not including the stack pointer). That gives you a latency bottleneck of ~6 cycles for `inc` with a memory destination, and is probably the overall bottleneck in your code (plus the time you spend returning out of that deep recursive call tree after doing the useful work.) – Peter Cordes Dec 21 '16 at 23:06
Thank you Peter! Lots of great stuff. By skipping the recursive call for the odd numbers I shaved off a lot of time. However the difference in time between `imul` and `lea` was not noticeable as was the case with keeping the counter in a register instead of memory (which I don't know why I didn't do in the first place and which I don't know why is not faster..). The `gcc -O2` (and `gcc -O3`, no difference for me) was twice as fast as my code which in turn was twice as fast as no optimization from compiling the original code. – MikL Dec 22 '16 at 22:55
But my assignment has just been due and is now uploaded, so I went with what I had written. However I will definitely look into the optimized compiler-code to study how it is doing it. I would guess it tries to completely eliminate recursion. Thank you again. – MikL Dec 22 '16 at 22:56
Yes, the most efficient way for a compiler to implement tail-recursion is with a loop. If LEA wasn't faster, are you sure you did it right? e.g. `lea 1(%rdi, %rdi, 2), %rdi` to do `rdi = 3*rdi + 1`. As I pointed out in the Collatz asm-optimization answer I linked, `lea (%rdi, %rdi, 2), %rdi` / `inc %rdi` is actually lower latency on Intel SnB-family CPUs than doing the +1 as part of the LEA. – Peter Cordes Dec 23 '16 at 01:32
Did you also get rid of the conditional branch, and use a `cmov` instead? That should help a lot. (Run `perf stat ./your_program` and look at the mispredicted-branch percentage). If your code is still branching on odd/even, that might be the major bottleneck and further improvements to the calculation will only make a small difference. (This is also covered in my answer on the linked question, as well as the major algorithmic improvement of always doing a `shr $1, %rdi` after a 3N+1 without having to check for odd. Or combining that shift into part of the 3N+1 calc...) – Peter Cordes Dec 23 '16 at 01:34
1

Thank you again Peter. I think that maybe the branching and the one recursive call was the reason why it made no difference between using `imul` or `lea`. However with my new implementation after having better time reading through your (and a lot of other people's) excellent answers and comments on the thread, I got rid of recursion and implemented some of those great ideas and tricks. The new program is now 30% faster than that generated with `gcc -O3`. I cannot believe how much more efficient this is. You guys are truly brilliant people. – MikL Dec 23 '16 at 19:12
cool, clad you had fun playing with asm tuning :) Intel Sandybridge-family has fewer "weird" bottlenecks than some earlier architectures, so often there's less trial-and-error with performance counters. Looking at perf counters to see how close you're coming to the theoretical max 4 uops per clock is fun, though :) – Peter Cordes Dec 23 '16 at 21:11

Collatz function in assembly - segmentation fault

3 Answers3