11

I have lists a,b,c,... of equal length. I'd like to sort all of them the order obtained by sorting a, i.e., I could do the decorate-sort-undecorate pattern

a, b, c = map(list, zip(*sorted(zip(a, b, c))))

or something like that. However, I'd like that the lists are sorted in place (I assume that sorted pulls everything from the temporary iterator passed to it to a temporary list, and then zip stuff into three output lists, so every datum in the input is copied twice unnecessarily) without creating temporary objects. So what I don't mean is:

a_sorted, b_sorted, c_sorted = map(list, zip(*sorted(zip(a, b, c))))
a[:] = a_sorted
b[:] = b_sorted
c[:] = c_sorted

How can I achieve that?

Emi OB
  • 2,814
  • 3
  • 13
  • 29
Bubaya
  • 615
  • 3
  • 13
  • 1
    Without any temporary objects (except O(1) memory)? That's not possible. That would require some dedicated "sort together" function. You can do it with 2*O(n) memory overhead though, if that's acceptable. However if your data was already stored as a list of lists, instead of many distinct lists, then you could sort the whole thing in-place according to one of the "columns". So perhaps it's best to rethink the overall data structure, if possible. – a_guest Dec 02 '21 at 16:01
  • @a_guest Yes, basically, that hypothetical `sort_together` is what I'm after. – Bubaya Dec 02 '21 at 16:31
  • What I meant is that such a function would need to be provided by the language / implementation, but since it's not there, you can't use it. So in that case, your only option is to fork CPython and implement it yourself, then use the fork :-) – a_guest Dec 02 '21 at 16:33
  • 1
    There is [`more_itertools.sort_together`](https://more-itertools.readthedocs.io/en/stable/api.html#more_itertools.sort_together), but looking at [its source-code](https://more-itertools.readthedocs.io/en/stable/_modules/more_itertools/more.html#sort_together), it's exactly the `zip(*sorted(zip(*iterables)))` that you want to avoid (although it's for iterables in general, not just lists - so the in-place concept wouldn't be applicable anyway). – Stef Dec 02 '21 at 17:17
  • [Similar question about numpy arrays](https://stackoverflow.com/questions/1903462/how-can-i-zip-sort-parallel-numpy-arrays) – Stef Dec 02 '21 at 17:18
  • I've got a bunch of sort algorithms implemented in Python and Cython (your choice) at https://stromberg.dnsalias.org/~strombrg/sort-comparison/ . One of them is the awesome timsort. These could easily be modified to do 3 swaps instead of one. – dstromberg Dec 04 '21 at 18:59
  • @dstromberg Your [shuffle](https://stromberg.dnsalias.org/svn/sorts/compare/trunk/performance-tests) function is [biased](https://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle#Na%C3%AFve_method), is that on purpose? – Kelly Bundy Dec 04 '21 at 19:35
  • @KellyBundy Would you believe that was how I was taught to shuffle an array in school? Anyway, I'm changing it to use random.shuffle() as I compose this message. – dstromberg Dec 05 '21 at 21:52

2 Answers2

11

I think "without creating temporary objects" is impossible, especially since "everything is an object" in Python.

You could get O(1) space / number of objects if you implement some sorting algorithm yourself, though if you want O(n log n) time and stability, it's difficult. If you don't care about stability (seems likely, since you say you want to sort by a but then actually sort by a, b and c), heapsort is reasonably easy:

def sort_together_heapsort(a, b, c):
    n = len(a)
    def swap(i, j):
        a[i], a[j] = a[j], a[i]
        b[i], b[j] = b[j], b[i]
        c[i], c[j] = c[j], c[i]
    def siftdown(i):
        while (kid := 2*i+1) < n:
            imax = kid if a[kid] > a[i] else i
            kid += 1
            if kid < n and a[kid] > a[imax]:
                imax = kid
            if imax == i:
                return
            swap(i, imax)
            i = imax
    for i in range(n // 2)[::-1]:
        siftdown(i)
    while n := n - 1:
        swap(0, n)
        siftdown(0)

Anyway, if someone's interested in just saving some amount of memory, that can be done by decorating in-place (building tuples and storing them in a):

def sort_together_decorate_in_a(a, b, c):
    for i, a[i] in enumerate(zip(a, b, c)):
        pass
    a.sort()
    for i, [a[i], b[i], c[i]] in enumerate(a):
        pass

Or if you trust that list.sort will ask for keys for the elements in order (at least in CPython it does, already did so when the key parameter was introduced 18 years ago, and I suspect will keep doing so):

def sort_together_iter_key(a, b, c):
    it = iter(a)
    b.sort(key=lambda _: next(it))
    it = iter(a)
    c.sort(key=lambda _: next(it))
    a.sort()

Testing memory and time with three lists of 100,000 elements:

15,072,520 bytes   152 ms  sort_together_sorted_zip
15,072,320 bytes   166 ms  sort_together_sorted_zip_2
14,272,576 bytes   152 ms  sort_together_sorted_zip_X
 6,670,708 bytes   126 ms  sort_together_decorate_in_a
 6,670,772 bytes   177 ms  sort_together_decorate_in_first_X
 5,190,212 bytes   342 ms  sort_multi_by_a_guest_X
 1,597,400 bytes   100 ms  sort_together_iter_key
 1,597,448 bytes   102 ms  sort_together_iter_key_X
       744 bytes  1584 ms  sort_together_heapsort
       704 bytes  1663 ms  sort_together_heapsort_X
       168 bytes  1326 ms  sort_together_heapsort_opti
       188 bytes  1512 ms  sort_together_heapsort_opti_X

Note:

  • The second solution is a shortened/improved version of yours, no need for temporary variables and conversions to lists.
  • The solutions with _X suffix are versions that take arbitrarily many lists as parameters.
  • The @a_guest is from their answer. Runtime-wise it currently benefits from my data being random, as that doesn't expose that solution's worst case complexity O(m * n²), where m is the number of lists and n is the length of each list.

Testing memory and time with ten lists of 100,000 elements:

19,760,808 bytes   388 ms  sort_together_sorted_zip_X
12,159,100 bytes   425 ms  sort_together_decorate_in_first_X
 5,190,292 bytes  1249 ms  sort_multi_by_a_guest_X
 1,597,528 bytes   393 ms  sort_together_iter_key_X
       704 bytes  4186 ms  sort_together_heapsort_X
       188 bytes  4032 ms  sort_together_heapsort_opti_X

The whole code (Try it online!):

import tracemalloc as tm
from random import random
from timeit import timeit

def sort_together_sorted_zip(a, b, c):
    a_sorted, b_sorted, c_sorted = map(list, zip(*sorted(zip(a, b, c))))
    a[:] = a_sorted
    b[:] = b_sorted
    c[:] = c_sorted

def sort_together_sorted_zip_2(a, b, c):
    a[:], b[:], c[:] = zip(*sorted(zip(a, b, c)))

def sort_together_sorted_zip_X(*lists):
    sorteds = zip(*sorted(zip(*lists)))
    for lst, lst[:] in zip(lists, sorteds):
        pass

def sort_together_decorate_in_a(a, b, c):
    for i, a[i] in enumerate(zip(a, b, c)):
        pass
    a.sort()
    for i, [a[i], b[i], c[i]] in enumerate(a):
        pass

def sort_together_decorate_in_first_X(*lists):
    first = lists[0]
    for i, first[i] in enumerate(zip(*lists)):
        pass
    first.sort()
    for i, values in enumerate(first):
        for lst, lst[i] in zip(lists, values):
            pass

def sort_together_iter_key(a, b, c):
    it = iter(a)
    b.sort(key=lambda _: next(it))
    it = iter(a)
    c.sort(key=lambda _: next(it))
    a.sort()

def sort_together_iter_key_X(*lists):
    for lst in lists[1:]:
        it = iter(lists[0])
        lst.sort(key=lambda _: next(it))
    lists[0].sort()

def sort_together_heapsort(a, b, c):
    n = len(a)
    def swap(i, j):
        a[i], a[j] = a[j], a[i]
        b[i], b[j] = b[j], b[i]
        c[i], c[j] = c[j], c[i]
    def siftdown(i):
        while (kid := 2*i+1) < n:
            imax = kid if a[kid] > a[i] else i
            kid += 1
            if kid < n and a[kid] > a[imax]:
                imax = kid
            if imax == i:
                return
            swap(i, imax)
            i = imax
    for i in range(n // 2)[::-1]:
        siftdown(i)
    while n := n - 1:
        swap(0, n)
        siftdown(0)

def sort_together_heapsort_X(*lists):
    a = lists[0]
    n = len(a)
    def swap(i, j):
        for lst in lists:
            lst[i], lst[j] = lst[j], lst[i]
    def siftdown(i):
        while (kid := 2*i+1) < n:
            imax = kid if a[kid] > a[i] else i
            kid += 1
            if kid < n and a[kid] > a[imax]:
                imax = kid
            if imax == i:
                return
            swap(i, imax)
            i = imax
    for i in range(n // 2)[::-1]:
        siftdown(i)
    while n := n - 1:
        swap(0, n)
        siftdown(0)

def sort_together_heapsort_opti(a, b, c):
    # Avoid inner functions and range-loop to minimize memory.
    # Makes it faster, too. But duplicates code. Not recommended.
    n = len(a)
    i0 = n // 2 - 1
    while i0 >= 0:
        i = i0
        while (kid := 2*i+1) < n:
            imax = kid if a[kid] > a[i] else i
            kid += 1
            if kid < n and a[kid] > a[imax]:
                imax = kid
            if imax == i:
                break
            a[i], a[imax] = a[imax], a[i]
            b[i], b[imax] = b[imax], b[i]
            c[i], c[imax] = c[imax], c[i]
            i = imax
        i0 -= 1
    while n := n - 1:
        a[0], a[n] = a[n], a[0]
        b[0], b[n] = b[n], b[0]
        c[0], c[n] = c[n], c[0]
        i = 0
        while (kid := 2*i+1) < n:
            imax = kid if a[kid] > a[i] else i
            kid += 1
            if kid < n and a[kid] > a[imax]:
                imax = kid
            if imax == i:
                break
            a[i], a[imax] = a[imax], a[i]
            b[i], b[imax] = b[imax], b[i]
            c[i], c[imax] = c[imax], c[i]
            i = imax

def sort_together_heapsort_opti_X(*lists):
    # Avoid inner functions and range-loop to minimize memory.
    # Makes it faster, too. But duplicates code. Not recommended.
    a = lists[0]
    n = len(a)
    i0 = n // 2 - 1
    while i0 >= 0:
        i = i0
        while (kid := 2*i+1) < n:
            imax = kid if a[kid] > a[i] else i
            kid += 1
            if kid < n and a[kid] > a[imax]:
                imax = kid
            if imax == i:
                break
            for lst in lists:
                lst[i], lst[imax] = lst[imax], lst[i]
            i = imax
        i0 -= 1
    while n := n - 1:
        for lst in lists:
            lst[0], lst[n] = lst[n], lst[0]
        i = 0
        while (kid := 2*i+1) < n:
            imax = kid if a[kid] > a[i] else i
            kid += 1
            if kid < n and a[kid] > a[imax]:
                imax = kid
            if imax == i:
                break
            for lst in lists:
                lst[i], lst[imax] = lst[imax], lst[i]
            i = imax

def sort_multi_by_a_guest_X(a, *lists):
    indices = list(range(len(a)))
    indices.sort(key=lambda i: a[i])
    a.sort()
    for lst in lists:
        for i, j in enumerate(indices):
            while j < i:
                j = indices[j]
            lst[i], lst[j] = lst[j], lst[i]

funcs = [
    sort_together_sorted_zip,
    sort_together_sorted_zip_2,
    sort_together_sorted_zip_X,
    sort_together_decorate_in_a,
    sort_together_decorate_in_first_X,
    sort_multi_by_a_guest_X,
    sort_together_iter_key,
    sort_together_iter_key_X,
    sort_together_heapsort,
    sort_together_heapsort_X,
    sort_together_heapsort_opti,
    sort_together_heapsort_opti_X,
]

n = 100000
a0 = [random() for _ in range(n)]
b0 = [x + 1 for x in a0]
c0 = [x + 2 for x in a0]

for _ in range(3):
    for func in funcs:

        a, b, c = a0[:], b0[:], c0[:]
        time = timeit(lambda: func(a, b, c), number=1)
        assert a == sorted(a0)
        assert b == sorted(b0)
        assert c == sorted(c0)

        a, b, c = a0[:], b0[:], c0[:]
        tm.start()
        func(a, b, c)
        memory = tm.get_traced_memory()[1] 
        tm.stop()

        print(f'{memory:10,} bytes  {int(time * 1e3):4} ms  {func.__name__}')
    print()
Kelly Bundy
  • 23,480
  • 7
  • 29
  • 65
  • 1
    I have difficulties understanding how your `sort_together` works, i.p. what `for i, a[i] in ...` does. Is it reassigning `a[i] = (a[i], b[i], c[i])` as a side effect? Looks like that. Please correct me if I'm wrong. – Bubaya Dec 03 '21 at 09:24
  • 1
    @Bubaya Yes, that's what it does, though I wouldn't call it a "side" effect. – Kelly Bundy Dec 03 '21 at 09:27
  • I had never seen this type of loop. Excellent idea! – Guimoute Dec 03 '21 at 09:29
  • "if you trust that `list.sort` will ask for keys for the elements in order" is that justified? Rather would like to have knowledge rather than trust ;). – Bubaya Dec 03 '21 at 09:30
  • @Bubaya It's not guaranteed, but it's "justified" in the sense that I mentioned there now. – Kelly Bundy Dec 03 '21 at 09:40
  • `sort_together_2` uses O(n*N) memory where `n` is the number of lists, so it doesn't use constant memory overhead (also, it's not stable w.r.t. `a` so you'd need to use a `key` function). `sort_together_3` relies on implementation details which are not part of the language spec, this can cause problems later on, also when porting to different implementations; I wouldn't use it. Also, `for i, a[i] in enumerate(zip(a, b, c)): pass` is an obscure way of writing `for i, values in enumerate(zip(a, b, c)): a[i] = values`. The former looks like a loop that does nothing or yet needs to be implemented. – a_guest Dec 03 '21 at 14:57
  • That's a great answer! – Timus Dec 04 '21 at 08:22
  • @KellyBundy I'm very willing to read your code. You get to post your code as an answer and I get to critique it in form of a comment, that's how the site works. But don't take this the wrong way, it's to help you improve your answer. Speaking of improvement, since the OP indicates that they're dealing with a variable number of lists, not necessarily three, and also to be relevant to other readers of this question, it would be good if your `sort_together_X` functions accepted a variable number of lists instead of exactly three (i.e. something like `def sort_together_X(*lists):`). – a_guest Dec 04 '21 at 19:24
  • @a_guest Ah yes, I overlooked their ",..". Other than that, the question really makes it look like exactly three. I'll update. – Kelly Bundy Dec 04 '21 at 19:50
1

The following function uses a memory overhead that is independent of the number of lists to sort. It is stable w.r.t. the first list.

def sort_multi(a, *lists):
    indices = list(range(len(a)))
    indices.sort(key=lambda i: a[i])
    a.sort()
    for lst in lists:
        for i, j in enumerate(indices):
            while j < i:
                j = indices[j]
            lst[i], lst[j] = lst[j], lst[i]
a_guest
  • 34,165
  • 12
  • 64
  • 118
  • I think Iconsidered that idea, not sure why I didn't implement it. Maybe because I thought the question is about exactly three lists, and then all those index objects and the extra list cost almost as much. – Kelly Bundy Dec 04 '21 at 20:55
  • Anyway, while this looks very elegant, it's only O(m * n^2) where m is the number of lists and n is the length of each list. – Kelly Bundy Dec 04 '21 at 20:58
  • (I believe the fast version will look almost as elegant, though) – Kelly Bundy Dec 04 '21 at 21:20
  • @KellyBundy Yes, it's not very efficient and honestly I cannot imagine a situation where I would want to sacrifice that performance for reduced memory consumption. – a_guest Dec 04 '21 at 22:19
  • Using `def sorted_multi(*lists)` and `indices = sorted(range(len(lists[0])), key=lists[0].__getitem__), lets you skip sorting `a` separately. I believe it is O(n log n) for the sort and O(mn) to copy the lists to the proper order (consider just copying each list to a temp and then copying it back in order). Unless the lists are very short or there are very many lists, the O(n log n) would dominate. – RootTwo Dec 06 '21 at 06:32
  • 1
    @RootTwo I'm using `a.sort()` on purpose, as it is likely to be more efficient than the swapping with indices. Also, the signature emphasizes that the sorting is performed w.r.t. the first list `a`, not all lists together. – a_guest Dec 06 '21 at 08:47