Unable to do tail recursion in C++ using clang with -O3 option

Question

I am unable to tell the clang compiler to perform tail-recursive optimisation in my C++ code. I saw this post Which, if any, C++ compilers do tail-recursion optimization?, where the advice is to use the -O3 flag with clang and it would do it. But in spite of that, I am having stack overflow for large inputs to the below primality testing code.

#include <iostream>
#include <chrono>
#include <gmp.h>
#include <string>

bool mpz_is_divisible(mpz_t n, mpz_t d);

void mpz_sqr(mpz_t res, mpz_t x) {
   //computes square of x and puts the result into res

   mpz_pow_ui(res, x, 2);
}

//the below function causes stack overflow for large inputs

void smallest_divisor(mpz_t n, mpz_t div, mpz_t div_sqr, mpz_t smallest_div) {
 //finds the smallest number which divides n and puts the result into smallest_div

  mpz_sqr(div_sqr, div);
  
  if (mpz_cmp(div_sqr, n) > 0) {
     mpz_set(smallest_div, n);
     return;
  }

  if (mpz_is_divisible(n, div)) {
    mpz_set(smallest_div, div);
    return;
  }

  mpz_add_ui(div, div, 1);
  smallest_divisor(n, div, div_sqr, smallest_div); //<-- should do tail recursion optimisation?
}

bool mpz_prime_test_basic(mpz_t n) {
  //checks if n is prime

  mpz_t div, div_sqr, smallest_div;

  mpz_inits(div, div_sqr, smallest_div, NULL);
  mpz_set_ui(div, 2);

  smallest_divisor(n, div, div_sqr, smallest_div);

  if (mpz_cmp(smallest_div, n) == 0) {
    mpz_clear(div);
    mpz_clear(div_sqr);
    mpz_clear(smallest_div);
    return true;
  }
  mpz_clear(div);
  mpz_clear(div_sqr);
  mpz_clear(smallest_div);
  return false;
}

bool mpz_is_divisible(mpz_t n, mpz_t d) {
  //checks if n is divisible by d
  mpz_t rem;
  mpz_init(rem);
  mpz_tdiv_r(rem, n, d);
  int cmp = mpz_cmp_si(rem, 0); //checks if remainder is equal to 0
  if (cmp == 0) return true;
  return false;
}

int main() {
  std::string num_str;
  mpz_t num;
  mpz_init(num);
  std::cout << "Enter number to check" << '\n';
  std::cin >> num_str;
  mpz_set_str(num, num_str.c_str(), 10);

  bool is_prime = mpz_prime_test_basic(num);

  if (is_prime) {
    std::cout << num_str << " is a prime\n";
  } else {
    std::cout << num_str << " is not a prime\n";
  }

  return 0;
}

I am going through chapter 1 of SICP right now and there's an exercise (1.22) about checking primality using an O(√n) algorithm. Tail-recursion is enforced by the Scheme standard, hence there is no problem there for large inputs but I am doing the same problem in C++, and for large numbers, the smallest-divisor function consumes the stack. I am using the GMP library for big integer arithmetic.

Is it possible to enforce tail recursion in such a program?

Thanks

What version of clang? Are you sure the stack overflow is related to the recursion itself? Because on Godbolt with clang 11, the tail recursion optimization seems done : https://godbolt.org/z/jqanTr — Christophe, Jan 05 '21 at 08:06
tail recursion can always be replaced with a loop which removes the need to hope that the compiler creates the loop for you, it also makes your code clearer: https://godbolt.org/z/5rdbbd — Alan Birtles, Jan 05 '21 at 08:23
@Christophe Apple clang version 12.0.0 (clang-1200.0.32.27). I think it might be due to it, all I get is a segmentation fault. — tf3, Jan 05 '21 at 08:24
@AlanBirtles i am doing SICP and I wanted to implement it in C++ without explicit loops (the way the book does it) — tf3, Jan 05 '21 at 08:27
@Christophe i will try to use a loop and see if it works, if it does, then it would almost confirm that (lack of) tail-recursion is the culprit, and if it doesn't, then tail-recursion is not to blame. — tf3, Jan 05 '21 at 08:30
Unrelated: If you use the gmp [C++ interface](https://gmplib.org/manual/C_002b_002b-Class-Interface), you can make your code cleaner by using its classes instead of the low level C types. RAII-based memory management, operator overloading, etc. — Shawn, Jan 05 '21 at 08:34
@Christophe I replaced the recursive call with the while loop and now it's runs just fine for large inputs. Looks like we can confirm the hypotheses that tail recursion optimisation was not happening. — tf3, Jan 05 '21 at 08:35
It would be strange it hapens in clang 11 and not in 12. Are you sure you build for release? — Christophe, Jan 05 '21 at 08:38
@Christophe I thought it was just the -O3 flag I had to set to make it release. And the above code is all there is to it. Is there a way to be explicit about compiling for release? Are there some other flags to set? I did #define DNDEBUG at the top and i still get Seg fault. — tf3, Jan 05 '21 at 08:41
I'm not sure about clang, but it's generally compatible with gcc, which uses `-foptimize-sibling-calls` to try to do TCO (Enabled at `-O2`). Of course, as you're finding, just asking the compiler to do tail call optimizations doesn't mean it actually will in every case. — Shawn, Jan 05 '21 at 08:53
@Christophe Clang "12" is a lie made up by Apple. Latest Clang is Clang 11. Apple Clang is an abomination that's usually a version or two behind (and might have some modifications made) but has a higher version number to mark it as being not "official" Clang. [Wikipedia lists](https://en.wikipedia.org/wiki/Xcode#Xcode_7.0_-_12.x_(since_Free_On-Device_Development)#:~:text=12.0.0) Clang 12 as actually being Clang 10. I would suggest to OP to install LLVM Clang. When I look at Godbolt, I see TCO happening back *many* Clang versions, so maybe Apple Clang just disables it or something. — HTNW, Jan 05 '21 at 08:55
@tf3 flags are not enough if you're in xcode: you need to go to menu Product > Scheme > Edit Scheme... and then chose "Release" as build configuration — Christophe, Jan 05 '21 at 09:00
@HTNW there are a lot of default settings. If you do not tell xcode it's for release you cannot be sure that another flag doesn't disable the optimizations. The same is btw true for VC, but it's more obvious to see in which mode you are and to switch between release and test. — Christophe, Jan 05 '21 at 09:03
@Christophe I am compiling in the terminal using the command "clang++ stackflw.cc -lgmp -O3 -std=c++14 " — tf3, Jan 05 '21 at 09:11
@Shawn yeah, if the C++ standard doesn't enforce it, it shouldn't come as surprising but it's in fact surprising since the LLVM clang 11 and gcc do it with the -O3 flag — tf3, Jan 05 '21 at 09:13

HTNW · Accepted Answer · 2021-01-05T17:31:11.120

Something is odd about mpz_is_divisible. For one, you forget to free the memory held by rem. Even if you add the mpz_clear(rem) call, you don't get TCO in smallest_divisor. What you have to do is mark it __attribute__((noinline)) to get Clang to do the optimization (though GCC does it without it). Very strange, and also moot, since mpz_is_divisible is just a worse version of mpz_divisible_p.

void smallest_divisor(mpz_t n, mpz_t div, mpz_t div_sqr, mpz_t smallest_div) {    
    mpz_sqr(div_sqr, div);
    if (mpz_cmp(div_sqr, n) > 0) {
        mpz_set(smallest_div, n);
        return;
    }
    if (mpz_divisible_p(n, div)) {
        mpz_set(smallest_div, div);
        return;
    }
    mpz_add_ui(div, div, 1);
    smallest_divisor(n, div, div_sqr, smallest_div); // gets TCO
}

Godbolt (steals GMP from trunk GCC, may have to update some paths with the current date to make it work)

I've also cleaned up the code a lot to produce a new version. In particular, my FP intuition tells me that smallest_divisor is really not supposed to be its own function. It's a recursive worker that belongs inside of mpz_prime_test_basic. We can use this answer to write local recursive definitions. I couldn't find gmpxx.h on Godbolt, so I also wrote a clone of mpz_class. Then we can write

bool is_prime(mpz_t n) noexcept {
    mpz div(2L);
    fix{[](auto rec, mpz_t n, mpz_t div, mpz_t div_sqr) noexcept -> void {
        mpz_mul(div_sqr, div, div);
        if(mpz_cmp(div_sqr, n) > 0) mpz_set(div, n);
        else if(!mpz_divisible_p(n, div)) {
            mpz_add_ui(div, div, 1);
            rec(n, div, div_sqr);
        }
    }}(n, div, mpz());
    return mpz_cmp(div, n) == 0;
}

Since we're not actually changing the arguments between the calls, just mutating behind them, we can also just write

bool is_prime(mpz_t n) noexcept {
    mpz div(2L), div_sqr;
    fix{[&](auto rec) noexcept -> void {
        mpz_mul(div_sqr, div, div);
        if(mpz_cmp(div_sqr, n) > 0) mpz_set(div, n);
        else if(!mpz_divisible_p(n, div)) {
            mpz_add_ui(div, div, 1);
            rec();
        }
    }}();
    return mpz_cmp(div, n) == 0;
}

Godbolt (same caveat)

Yeah, I didn't get TCO by mpz_clear(rem). Disappointing to say the least, maybe it's Clang's "fault". Thanks for the effort! — tf3, Jan 05 '21 at 17:18

score 1 · Answer 2 · answered Jan 05 '21 at 09:30

In principle, all major C++ compilers make tail call optimizations. And your code should benefit from it.

Having looked at the assembly obtained from your code on Godbolt with Clang 11, it appears that the -O3 flag indeed leads to tail code optimization. Your recursive function is compiled as:

smallest_divisor(test, test, test, test): # @smallest_divisor(test, test, test, test)
  push rax
.LBB1_1: # =>This Inner Loop Header: Depth=1  <------ TCO Loop 
  mov edi, 2
  call mpz_pow_ui(test, test, unsigned int)
  call mpz_cmp(test, test)
  test eax, eax
  jg .LBB1_4                                  <------ go to return sequence
  call mpz_init(test)
  call mpz_tdiv_r(test, test, test)
  xor edi, edi
  call mpz_cmp_si(test, int)
  test eax, eax
  je .LBB1_4
  mov edi, 1
  call mpz_add_ui(test, test, unsigned int)
  jmp .LBB1_1                                <--- no recursive call but loop
.LBB1_4:                                     <--- this is the end of recursion
  pop rax
  jmp mpz_set(test, test) # TAILCALL         <-- no ret, but another tail call :-)

I cannot test your code, since I do not have the library and do not have the time to invest in getting it and installing it.

From the symptoms you describe, I can see the following possible root causes:

there's some UB in the library, leading to a segfault
optimization is disable due to compiler settings. If you're using XCode for instance, you'll need to go to menu Product > Scheme > Edit Scheme... and then chose "Release" as build configuration.

Having tried the assembly in XCode (Product > Perform action > Assemble), the generated assembly code shows evidence of TCO:

__Z16smallest_divisor4testS_S_S_:       ## @_Z16smallest_divisor4testS_S_S_
Lfunc_begin1:
    .loc    67 29 0                 ## Test0105-1/main.cpp:29:0
    .cfi_startproc
## %bb.0:
    pushq   %rbp
    .cfi_def_cfa_offset 16
    .cfi_offset %rbp, -16
    movq    %rsp, %rbp
    .cfi_def_cfa_register %rbp
LBB1_1:                                 ## =>This Inner Loop Header: Depth=1
Ltmp3:
    .loc    67 24 4 prologue_end    ## Test0105-1/main.cpp:24:4
    movl    $2, %edi
    callq   __Z10mpz_pow_ui4testS_j
   ....
    movl    $1, %edi
    callq   __Z10mpz_add_ui4testS_j
Ltmp13:
    .loc    67 0 3 is_stmt 0        ## Test0105-1/main.cpp:0:3
    jmp LBB1_1                             <<<<<<<<<<<< Tail call optimization (loop)
LBB1_4:
    popq    %rbp
Ltmp14:
    jmp __Z7mpz_set4testS_      ## TAILCALL

Thanks for the answer, though at the moment it's beyond my ability to understand it. Maybe it's got to do with the library. — tf3, Jan 05 '21 at 09:35
Since you use the command line,you can generate yourself the assembly with clang++ ... -O3 -std=c++14 -S -masm=intel. You can then search for smallest_divisor, and find a "mangled" identifier (string with some weird prefix and suffix) at the beginning of a line with a ":". This is your function (between .cfi_startproc and .cfi_endproc). If it'd be recursive you'd find a line like "call ...smallest_divisor...." (that's a function call in assembler). But if you don't find this recursive call and instead find a JMP, JE, JNZ or alike to a label towards the beginning of your function, it's TCO. — Christophe, Jan 05 '21 at 10:01
I find this line between .cfi_startproc and .cfi_endproc: call __Z16smallest_divisorP12__mpz_structS0_S0_S0_, so maybe it is recursive — tf3, Jan 05 '21 at 11:29
@tf3 interesting indeed! This strongly suggest a recursive one. The only call you should have to this function should be between the cfi_startproc and endproc corresponding to `...mpz_prime_test_basic...:`, but that’s 30 lines below the endproc of the smallest divisor. I have the same version of the compiler. So it’s either a question of flages, or there are some issues with the libraries that messes up with optimizer. — Christophe, Jan 05 '21 at 12:20
I compiled your modified version of code from godbolt where you created a "test " class and used mpz_t as an "alias" for test, using the same flags as you said (-S -masm=intel), and the assembly code does perform tail recursive optimisation. So it really is down to the GMP library, which is the obstacle here to tail recursion optimisation. So finally, it's light at the end of the tunnel. I will just use loops :) — tf3, Jan 05 '21 at 12:35
@tf3 The problem seems to be `mpz_is_divisble`... somehow. If I replace it with just GMP's `mpz_divisible_p`, then `smallest_divisor` gets TCO'd. If it's your function, it doesn't. Odd. If I stick a `__attribute__((noinline))` on `mpz_is_divisible`, it also works (after I add the `mpz_clear(rem);` that you forgot!). — HTNW, Jan 05 '21 at 16:36
@HTNW yep, it seems I forgot something big, no wonder the program was also hogging memory! Maybe, not freeing the memory in the mpz_is_divisible was preventing it from taking constant memory resources, which in turn prevented it from becoming an iterative process! Hence the recursion. This is the answer. Write an answer and I will accept it. — tf3, Jan 05 '21 at 17:02

Unable to do tail recursion in C++ using clang with -O3 option

2 Answers2