Java pseudo tail call recursion yields better performance

Question

I was just writing some simple utility that calculates the length of a linkedList such that the linkedList doesn't host an "internal" counter of its size/length. With this is mind, I had 3 simple approaches:

Iterate over the linkedList until you hit the "end"
Recursively calculate the length
Recursively calculate the length without needing to return control to the calling function (using some flavor of tail-call recursion)

Below is some code that captures these 3 cases:

// 1. iterative approach
public static <T> int getLengthIteratively(LinkedList<T> ll) {

    int length = 0;
    for (Node<T> ptr = ll.getHead(); ptr != null; ptr = ptr.getNext()) {
        length++;
    }

    return length;
}

// 2. recursive approach
public static <T> int getLengthRecursively(LinkedList<T> ll) {
    return getLengthRecursively(ll.getHead());
}

private static <T> int getLengthRecursively(Node<T> ptr) {

    if (ptr == null) {
        return 0;
    } else {
        return 1 + getLengthRecursively(ptr.getNext());
    }
}

// 3. Pseudo tail-recursive approach
public static <T> int getLengthWithFakeTailRecursion(LinkedList<T> ll) {
    return getLengthWithFakeTailRecursion(ll.getHead());
}

private static <T> int getLengthWithFakeTailRecursion(Node<T> ptr) {
    return getLengthWithFakeTailRecursion(ptr, 0);
}

private static <T> int getLengthWithFakeTailRecursion(Node<T> ptr, int result) {
    if (ptr == null) {
        return result;
    } else {
        return getLengthWithFakeTailRecursion(ptr.getNext(), result + 1);
    }
}

Now I'm aware that the JVM doesn't support tail-recursion out of the box, but when I ran some simple tests that had linkedLists of strings with ~10k nodes, I noticed the getLengthWithFakeTailRecursion consistently outperform the getLengthRecursively method (by ~40%). Could the delta be only attributed to the fact that control is being passed back per node for case#2 and we're forced to traverse across all stack-frames?

Edit: Here's the simple test I used to check the performance numbers:

public class LengthCheckerTest {

@Test
public void testLengthChecking() {

    LinkedList<String> ll = new LinkedList<String>();
    int sizeOfList = 12000;
    // int sizeOfList = 100000; // Danger: This causes a stackOverflow in recursive methods!
    for (int i = 1; i <= sizeOfList; i++) {
        ll.addNode(String.valueOf(i));
    }

    long currTime = System.nanoTime();
    Assert.assertEquals(sizeOfList, LengthChecker.getLengthIteratively(ll));
    long totalTime = System.nanoTime() - currTime;
    System.out.println("totalTime taken with iterative approach: " + (totalTime / 1000) + "ms");

    currTime = System.nanoTime();
    Assert.assertEquals(sizeOfList, LengthChecker.getLengthRecursively(ll));
    totalTime = System.nanoTime() - currTime;
    System.out.println("totalTime taken with recursive approach: " + (totalTime / 1000) + "ms");

    // Interestingly, the fakeTailRecursion always runs faster than the vanillaRecursion
    // TODO: Look into whether stack-frame collapsing has anything to do with this
    currTime = System.nanoTime();
    Assert.assertEquals(sizeOfList, LengthChecker.getLengthWithFakeTailRecursion(ll));
    totalTime = System.nanoTime() - currTime;
    System.out.println("totalTime taken with fake TCR approach: " + (totalTime / 1000) + "ms");
}
}

Tagir Valeev · Accepted Answer · 2016-02-16T04:28:46.037

Your benchmarking methodology is flawed. You perform all three tests in the same JVM, so they are not in the equal position. When fake-tail test is performed, LinkedList and Node classes are already JIT-compiled, so it works faster. You may change the order of your tests and you will see different numbers. Every test should be performed in separate JVM.

Let's write simple JMH microbenchmark for your case:

import java.util.concurrent.TimeUnit;

import org.openjdk.jmh.infra.Blackhole;
import org.openjdk.jmh.annotations.*;

// 5 warm-up iterations, 500 ms each, then 10 measurement iterations 500 ms each
// repeat everything three times (with JVM restart)
// output average time in microseconds
@Warmup(iterations = 5, time = 500, timeUnit = TimeUnit.MILLISECONDS)
@Measurement(iterations = 10, time = 500, timeUnit = TimeUnit.MILLISECONDS)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
@Fork(3)
@State(Scope.Benchmark)
public class ListTest {
    // You did not supply Node and LinkedList implementation
    // but I assume they look like this
    static class Node<T> {
        final T value;
        Node<T> next;

        public Node(T val) {value = val;}
        public void add(Node<T> n) {next = n;}

        public Node<T> getNext() {return next;}
    }

    static class LinkedList<T> {
        Node<T> head;

        public void setHead(Node<T> h) {head = h;}
        public Node<T> getHead() {return head;}
    }

    // Code from your question follows

    // 1. iterative approach
    public static <T> int getLengthIteratively(LinkedList<T> ll) {

        int length = 0;
        for (Node<T> ptr = ll.getHead(); ptr != null; ptr = ptr.getNext()) {
            length++;
        }

        return length;
    }

    // 2. recursive approach
    public static <T> int getLengthRecursively(LinkedList<T> ll) {
        return getLengthRecursively(ll.getHead());
    }

    private static <T> int getLengthRecursively(Node<T> ptr) {

        if (ptr == null) {
            return 0;
        } else {
            return 1 + getLengthRecursively(ptr.getNext());
        }
    }

    // 3. Pseudo tail-recursive approach
    public static <T> int getLengthWithFakeTailRecursion(LinkedList<T> ll) {
        return getLengthWithFakeTailRecursion(ll.getHead());
    }

    private static <T> int getLengthWithFakeTailRecursion(Node<T> ptr) {
        return getLengthWithFakeTailRecursion(ptr, 0);
    }

    private static <T> int getLengthWithFakeTailRecursion(Node<T> ptr, int result) {
        if (ptr == null) {
            return result;
        } else {
            return getLengthWithFakeTailRecursion(ptr.getNext(), result + 1);
        }
    }

    // Benchmarking code

    // Measure for different list length        
    @Param({"10", "100", "1000", "10000"})
    int n;
    LinkedList<Integer> list;

    @Setup    
    public void setup() {
        list = new LinkedList<>();
        Node<Integer> cur = new Node<>(0);
        list.setHead(cur);
        for(int i=1; i<n; i++) {
            Node<Integer> next = new Node<>(i);
            cur.add(next);
            cur = next;
        }
    }

    // Do not forget to return result to the caller, so it's not optimized out
    @Benchmark    
    public int testIteratively() {
        return getLengthIteratively(list);
    }

    @Benchmark    
    public int testRecursively() {
        return getLengthRecursively(list);
    }

    @Benchmark    
    public int testRecursivelyFakeTail() {
        return getLengthWithFakeTailRecursion(list);
    }
}

Here's the results on my machine (x64 Win7, Java 8u71)

Benchmark                           (n)  Mode  Cnt   Score    Error  Units
ListTest.testIteratively             10  avgt   30   0,009 ±  0,001  us/op
ListTest.testIteratively            100  avgt   30   0,156 ±  0,001  us/op
ListTest.testIteratively           1000  avgt   30   2,248 ±  0,036  us/op
ListTest.testIteratively          10000  avgt   30  26,416 ±  0,590  us/op
ListTest.testRecursively             10  avgt   30   0,014 ±  0,001  us/op
ListTest.testRecursively            100  avgt   30   0,191 ±  0,003  us/op
ListTest.testRecursively           1000  avgt   30   3,599 ±  0,031  us/op
ListTest.testRecursively          10000  avgt   30  40,071 ±  0,328  us/op
ListTest.testRecursivelyFakeTail     10  avgt   30   0,015 ±  0,001  us/op
ListTest.testRecursivelyFakeTail    100  avgt   30   0,190 ±  0,002  us/op
ListTest.testRecursivelyFakeTail   1000  avgt   30   3,609 ±  0,044  us/op
ListTest.testRecursivelyFakeTail  10000  avgt   30  41,534 ±  1,186  us/op

As you can see fake-tail speed is the same as simple recursion speed (within the error margin) and by 20-60% slower than iterative approach. So your result is not reproduced.

If you actually want to get not the steady-state measurement result, but the results of single-shot (without warm-up), you may launch the same benchmark with the following options: -ss -wi 0 -i 1 -f 10. The results will be like here:

Benchmark                           (n)  Mode  Cnt    Score     Error  Units
ListTest.testIteratively             10    ss   10   16,095 ±   0,831  us/op
ListTest.testIteratively            100    ss   10   19,780 ±   6,440  us/op
ListTest.testIteratively           1000    ss   10   74,316 ±  26,434  us/op
ListTest.testIteratively          10000    ss   10  366,496 ±  42,299  us/op
ListTest.testRecursively             10    ss   10   19,594 ±   7,084  us/op
ListTest.testRecursively            100    ss   10   21,973 ±   0,701  us/op
ListTest.testRecursively           1000    ss   10  165,007 ±  54,915  us/op
ListTest.testRecursively          10000    ss   10  563,739 ±  74,908  us/op
ListTest.testRecursivelyFakeTail     10    ss   10   19,454 ±   4,523  us/op
ListTest.testRecursivelyFakeTail    100    ss   10   25,518 ±  11,802  us/op
ListTest.testRecursivelyFakeTail   1000    ss   10  158,336 ±  43,646  us/op
ListTest.testRecursivelyFakeTail  10000    ss   10  755,384 ± 232,940  us/op

So as you can see, the first launch is many times slower than consequent launches. And your result is still not reproduced. I observe that testRecursivelyFakeTail is actually slower for n = 10000 (but after warm-up it reaches the same peak speed as testRecursively.

You are indeed correct about how I measured the time taken (I've edited my post to reflect this). Would you recommend any literature where I could educate myself to get better at recognizing these naive pitfalls in context of benchmarking? — BSJ, Feb 16 '16 at 04:26
@BSJ, I edited the answer to reflect your changes. As for your question, read JMH microbenchmark page along with [samples](http://hg.openjdk.java.net/code-tools/jmh/file/tip/jmh-samples/src/main/java/org/openjdk/jmh/samples/). Read [this question](http://stackoverflow.com/q/504103/4856258). Read some blog posts of performance engineers like Shipilev's [Nanotrusting the nanotime](http://shipilev.net/blog/2014/nanotrusting-nanotime/). — Tagir Valeev, Feb 16 '16 at 04:30

Java pseudo tail call recursion yields better performance

1 Answers1