Is the actual performance difference between these two O(n^2) algorithms coming from cache/memory access?

Question

I wrote two methods to sort a list of numbers, they have the same time complexity: O(n^2), but the actually running time is 3 times difference ( the second method uses 3 times as much time as the first one ).

My guess is the difference comes from the memory hierarchy ( count of register/cache/memory fetches are quite different ), is it correct?

To be specific: the 1st method compare one list element with a variable and assign value between them, the 2nd method compares two list elements and assign values between them. I guess this means the 2nd method has much more cache/memory fetches than the 1st one. Right?

When list has 10000 elements, the loop count and running time are as below:

# TakeSmallestRecursivelyToSort   Loop Count: 50004999
# TakeSmallestRecursivelyToSort   Time: 7861.999988555908       ms
# CompareOneThenMoveUntilSort     Loop Count: 49995000
# CompareOneThenMoveUntilSort     Time: 17115.999937057495      ms

This is the code:

# first method
def TakeSmallestRecursivelyToSort(input_list: list) -> list:
    """In-place sorting, find smallest element and swap."""
    count = 0
    for i in range(len(input_list)):
        #s_index = Find_smallest(input_list[i:]) # s_index is relative to i
        if len(input_list[i:]) == 0:
            raise ValueError
        if len(input_list[i:]) == 1:
            break
        index = 0
        smallest = input_list[i:][0]
        for e_index, j in enumerate(input_list[i:]):
            count += 1
            if j < smallest:
                index = e_index
                smallest = j
        s_index = index
        input_list[i], input_list[s_index + i] = input_list[s_index + i], input_list[i]
    print('TakeSmallestRecursivelyToSort Count', count)
    return input_list


# second method
def CompareOneThenMoveUntilSort(input_list: list) -> list:
    count = 0
    for i in range(len(input_list)):
        for j in range(len(input_list) - i - 1):
            count += 1
            if input_list[j] > input_list[j+1]:
                input_list[j], input_list[j+1] = input_list[j+1], input_list[j]
    print('CompareOneThenMoveUntilSort Count', count)
    return input_list

`inputlist[i:]` creates a copy of the list, taking O(N) time. — Martijn Pieters, Sep 07 '17 at 11:52
`input_list[i:][0]` is a very expensive way of spelling `input_list[i]`. — Martijn Pieters, Sep 07 '17 at 11:53
Possible duplicate of [What is a plain English explanation of "Big O" notation?](https://stackoverflow.com/questions/487258/what-is-a-plain-english-explanation-of-big-o-notation) — , Sep 07 '17 at 11:55
A constant factor (as well as a constant offset) is always included in Big O. All it talks about is asymptotic complexity growth for large inputs for a given algorithm. Does not make any claims for comparisons of actual runtime between algorithms/implementation. — Thilo, Sep 07 '17 at 11:57
Big-O hasn't got anything to do with the actual runtime. Apart from the inefficient implementation the first example contains 3 times as many branches, which will usually slow down the execution-time. But the gist is: **never** use Big-O to conclude anything about the actual runtime. — , Sep 07 '17 at 11:58
@Thilo , my question's intention is to understand how the details of implementation lead to different running time. I've read about how cache affects the performance, and I want to hear from others what other factors I should pay attention to. — Yoyo, Sep 07 '17 at 12:01
@MartijnPieters , thanks for the comment, I didn't know that means copying list. — Yoyo, Sep 07 '17 at 12:02

Martijn Pieters · Answer 1 · 2017-09-07T12:34:25.270

Your first algorithm may make O(N^2) comparisons, but it only makes O(N) swaps. It's those swaps that take the most time. If you removed the swaps from the second algorithm you'll see that it then takes significantly less time:

def CompareOneThenMoveUntilSortNoSwap(input_list: list) -> list:
    for i in range(len(input_list)):
        for j in range(len(input_list) - i - 1):
            if input_list[j] > input_list[j+1]:
                pass

# 1000 randomised sequential integers, 100 repeats
TakeSmallestRecursivelyToSort:     4.625916245975532
CompareOneThenMoveUntilSort:       10.164166125934571
CompareOneThenMoveUntilSortNoSwap: 4.86395191506017

Just because two algorithms are in the same asymptotic order doesn't mean they'll be just as fast. Those constant costs still count when comparing implementations of an algorithm within the same order class. So while the two implementations will show the same exponential curve as you plot time taken for the number of elements sorted, the CompareOneThenMoveUntilSort implementation plots the line higher up the time-taken chart.

Note that you have increased the constant cost of each N loop in the TakeSmallestRecursivelyToSort implementation by adding 4 additional O(N) loops in there. Each inputlist[i:] slice creates a new list object, copying across all references from index i onwards to the new list. It could be faster still:

def TakeSmallestRecursivelyToSortImproved(input_list: list) -> list:
    """In-place sorting, find smallest element and swap."""
    l = len(input_list)
    for i in range(l - 1):
        index = i
        smallest = input_list[i]
        for j, value in enumerate(input_list[i + 1:], i + 1):
            if value < smallest:
                smallest, index = value, j
        input_list[i], input_list[index] = input_list[index], input_list[i]
    return input_list

This one takes about 3 seconds.

The first method uses input_list[i:][0], but the second method uses 3 times as much time as the first one ......so your comment doesn't really explain the difference. — Yoyo, Sep 07 '17 at 12:11
@Yoyo: your question was not clear on that. I see what you did wrong and I'll update. — Martijn Pieters, Sep 07 '17 at 12:19
Thanks for your explain. I thought the more list[index] operations means more cache miss in second method which lead to more time in memory access, but here your deduction makes more sense, I used too many swaps in second method. Even if all of them are in cache, the 2nd method will still use much more time. — Yoyo, Sep 07 '17 at 12:33
@Yoyo: You really don't need to worry about the CPU cache when coding in Python; the CPU is too far removed from the Python code here; there is a whole bytecode interpreter loop in between. — Martijn Pieters, Sep 07 '17 at 12:35

Is the actual performance difference between these two O(n^2) algorithms coming from cache/memory access?

1 Answers1