How to optimize the sorting of a big table with Python

Question

I have a table A with all values hidden.
The goal is to find a trick to sort these values with the help of two functions :

compare (x, y) checks in the table A whether x value < y value or not
If not : sort(x, y) sorts x compared to y with : A[x], A[y] = A[y], A[x]

I found a way to loop on that and automatically sort the table A.
Here is the snippet :

    for i in range (len(A)):
            for j in range (len(A)):
                    if compare (i, j):
                            sort(i, j)
                    pass

That works pretty well but I'd like to find another way to accelerate the processing.
Indeed the length of my table A is 32 so it is ok. But if the length is 1000, 5000 or even 10000 it would clearly take a lot of time and that could be a problem (especially if a time limit exists).
Consequently does another method exist to optimize my loop to bigger tables ?

Any help would be appreciated, thanks.

`sort` is unclear. You say "If not". Does that mean if you call it but "x value < y value" is not true, then it *doesn't* do the swap? Better provide both functions. And input/output. — Kelly Bundy, Apr 28 '23 at 09:32

score 1 · Answer 1 · answered Apr 28 '23 at 11:21

You can use the built-in sorting functions with custom comparator. By that, you get the desired permutation of a. Then, apply the swaps to actually put items of a into right places. Below is a step-by-step explanation.

First, let us pretend we have a list a with functions compare and swap, and we also know its length n:

a = [13, 77, 26, 44, 27, 35, 99, 27, 91, 18, 23, 48]
def compare (i, j):
    return a[i] % 10 < a[j] % 10
def swap (i, j):
    a[i], a[j] = a[j], a[i]
n = len (a)

In this example, compare compares the last digits of the numbers in a.

Now, in the question, we don't have direct access to a. Instead, we have to use the two functions.

Nevertheless, we will want to use a builtin Python sorting function. What can we sort if a is not accessible? We can create a permutation p: a list of positions in a. And then sort it according to what would be the right order in a. The call to sort will make O(n log n) calls to compare.

from functools import cmp_to_key
cmp = lambda i, j: 0 if i == j else -1 if compare (i, j) else +1
p = list (range (n))
p.sort (key = cmp_to_key (cmp))
print (p)  # [8, 0, 10, 3, 5, 2, 1, 4, 7, 9, 11, 6]

The result [8, 0, 10, ..., 11, 6] means the following:
"The right order is a[8], then a[0], then a[10], ..., then a[11], then a[6]".

To learn more about sorting with a custom comparator, see here.
To learn how cmp_to_key works, see there.

Permutation p says "p[i] is where to get the i-th element of the answer". For convenience, let us construct the inverse permutation q instead. Permutation q will say "q[i] is where to put the i-th element of the input".

q = [0] * n
for i in range (n):
    q[p[i]] = i
print (q)  # [1, 6, 5, 3, 7, 4, 11, 8, 0, 9, 2, 10]

The result [1, 6, 5, ..., 2, 10] means the following:
a[0] goes into position 1;
a[1] goes into position 6;
a[2] goes into position 5;
...
a[10] goes into position 2;
a[11] goes into position 10.

Now, we finally move the elements of a using swap.

for i in range (n):
    while q[i] != i:
        j = q[i]
        swap (i, j)
        q[i], q[j] = q[j], q[i]
print (a)  # [91, 13, 23, 44, 35, 26, 77, 27, 27, 18, 48, 99]

This says: while the element at position i has to go somewhere else, let us put it there, at position j = q[i]. As we can only swap pairs of elements, the element which was at position j simultaneously goes to position i. Now q[j] is j: this element went to its final place. And q[i] now is the former value of q[j]: this element may still have to go somewhere.

Each operation in the body of the while loop puts one element to its final place, and we don't move it after that. Since there are n elements in total, we will make no more than n calls to swap.

In total, we made O(n log n) calls to compare and O(n) calls to swap.
Complete code is below. Your program will not have the first part with a, compare, and swap, because they are defined externally.

a = [13, 77, 26, 44, 27, 35, 99, 27, 91, 18, 23, 48]
def compare (i, j):
    return a[i] % 10 < a[j] % 10
def swap (i, j):
    a[i], a[j] = a[j], a[i]
n = len (a)

from functools import cmp_to_key
cmp = lambda i, j: 0 if i == j else -1 if compare (i, j) else +1
p = list (range (n))
p.sort (key = cmp_to_key (cmp))
print (p)  # [8, 0, 10, 3, 5, 2, 1, 4, 7, 9, 11, 6]

q = [0] * n
for i in range (n):
    q[p[i]] = i
print (q)  # [1, 6, 5, 3, 7, 4, 11, 8, 0, 9, 2, 10]

for i in range (n):
    while q[i] != i:
        j = q[i]
        swap (i, j)
        q[i], q[j] = q[j], q[i]
print (a)  # [91, 13, 23, 44, 35, 26, 77, 27, 27, 18, 48, 99]

Neat. But why the `% 10`? That just looks distracting/wrong. — Kelly Bundy, Apr 28 '23 at 11:34
@KellyBundy The `% 10` is to emphasize that the comparison can be any custom thing, and the algorithm will still work. Hmm, now that I think of it, it's not strictly necessary... — Gassa, Apr 28 '23 at 22:54
Yeah, I don't think it's necessary, and rather disagrees with the problem description's "whether x value < y value or not", which looks like using the elements' natural comparison. If you want to make it clearer that the solution doesn't use any properties of the elements, maybe really hide the list from the solution, so it only has access to compare/swap/n. [Demo](https://dpaste.org/LDW7s). — Kelly Bundy, Apr 28 '23 at 23:12

L.Grozinger · Answer 2 · 2023-04-28T09:03:39.990

Optimising your sorting algorithm

You seem to have implemented some kind of variation of the bubble sort. A more usual implementation would be something like:

def bubblesort(A):
    swapped = True
    while swapped:
        swapped = False
        for i in range(len(A)-1):
            if A[i] > A[i+1]:
                A[i], A[i+1] = A[i+1], A[i]
                swapped = True
    return A

>>> A = [1, 6, 3, 7, 7, 2, 3, 64, 7, 8, 4, 72, 724, 46]
>>> bubblesort(A)
[1, 2, 3, 3, 4, 6, 7, 7, 7, 8, 46, 64, 72, 724]

Note that this is already an improvement on your algorithm, since yours always does the n**2 comparisons, where n=len(A), whereas the bubble sort given above will only do this many comparisons in the worst case, because it does n - 1 comparisons each pass, and exits early if it doesn't do a swap on a pass (which happens only if the list is sorted).

This is a O(n**2) average complexity algorithm. So the good news is, this is probably the worst (sensible) algorithm for sorting you could have written. Meaning, almost anything you do next will be an improvement!

Remaining with bubble sort, one simple thing you can do is to notice that all the elements that come after the last swap you did must already be in order, so you can skip those on the next go round:

def bubblesort(A):
    n = len(A)
    while n > 0:
        m = 0
        for i in range(n-1):
            if A[i] > A[i+1]:
                A[i], A[i+1] = A[i+1], A[i]
                m = i + 1
        n = m
    return A

Here you mark where the last swap was made every iteration with variable n, and only check up to n the next time round.

Much faster algorithms exist

But its still very slow compared to other sorting algorithms. You can try to implement others like merge sort or quick sort, which for larger lists will turn out much quicker.

Great, thank you very much for your answer. I have tested it and it indeed works better. With all I could read, I guess the merge sort and quick sort implementations would be the best solutions. Could I ask you how to implement my snippet with Quick sort algorithm ? — Julien, Apr 28 '23 at 09:39
just as an exercise I measured the performance of improved solution to @Julien's as I proposed and your bubblesort on a randomly generated list of 10000 integers between 1 - 1000: improved nested for loops did it in slightly over 3 s and bubblesort in slightly under 9 s performing almost twice as many iterations — adam, Apr 28 '23 at 10:04
@adam perhaps yes, although the worst case time complexity of bubble sort and your loops isn't any different. — L.Grozinger, Apr 28 '23 at 12:11
Anyway, the point is not to use bubble sort. Bubble sort is a bad algorithm. Any variation on the nested looping thing is even worse, in my opinion, because it always performs as the worst case, and never saves itself any work — L.Grozinger, Apr 28 '23 at 12:14

score 0 · Answer 3 · answered Apr 28 '23 at 09:12

You have to learn about sorting algorithms, see inspiration here

Your double nested for loop is of time complexity O(n^2) which is pretty bad. If you adapt it so that the inner loop starts at first index + 1, the time complexity does not change, but the performance will be much better. But still there are much better sorting algorithms than that.

import random
a = [random.randint(1, 100) for _ in range(100)]

for i in range(len(a)):
    for j in range(i+1, len(a)):
        if a[j] < a[i]:
            a[i], a[j] = a[j], a[i]

I measured the improvement on an array of 2000 randomly generated items and it was around twice as fast as the original loop.

Thanks for the link, I'll read that carefully – Julien Apr 28 '23 at 09:37 — Julien, Apr 28 '23 at 09:37

DarkKnight · Answer 4 · 2023-04-28T09:40:43.163

Presumably doing a traditional less-than or greater-than comparison of elements in your list is not appropriate - that's the only reason why you would need to implement your compare() function.

The implementation of compare() in this answer is functionally no different to a direct element comparison but serves to show how you could do this.

We can incorporate the custom function into a Quicksort implementation as follows:

def compare(x, y):
    return 1 if x > y else -1 if x < y else 0

def quicksort(A):
    def _partition(A, lo, hi):
        pivot = A[(lo+hi)//2]
        while True:
            while compare(A[lo], pivot) < 0:
                lo += 1
            while compare(A[hi], pivot) > 0:
                hi -= 1
            if lo >= hi:
                return hi
            A[lo], A[hi] = A[hi], A[lo]
            lo += 1
            hi -= 1
    def _quicksort(A, lo, hi):
        if lo < hi:
            p = _partition(A, lo, hi)
            _quicksort(A, lo, p)
            _quicksort(A, p + 1, hi)
        return A
    return _quicksort(A, 0, len(A)-1)

Thank you for your answer. The quicksort could be very useful, I will give a try, thanks again — Julien, Apr 28 '23 at 09:36

How to optimize the sorting of a big table with Python

4 Answers4