6

Is there an easy (meaning without rolling one's own sorting function) way to sort parallel lists without unnecessary copying in Python? For example:

foo = range(5)
bar = range(5, 0, -1)
parallelSort(bar, foo)
print foo # [4,3,2,1,0]
print bar # [1,2,3,4,5]

I've seen the examples using zip but it seems silly to copy all your data from parallel lists to a list of tuples and back again if this can be easily avoided.

dsimcha
  • 67,514
  • 53
  • 213
  • 334
  • 1
    What do you think this parallelSort would do? It appears from your comments that it sort foo in decreasing order and bar in increasing order -- is that right? –  Feb 08 '10 at 15:55
  • @Paul: It sorts bar and manipulates foo in lockstep. – dsimcha Feb 08 '10 at 15:58
  • What will `parallelSort` gives if initially `foo` is `[2,4,6,10,8]` and `bar` is `[3,7,9,5,1]`? – kennytm Feb 08 '10 at 15:58
  • @Kenny: bar is now sorted and equals [1,3,5,7,9]. foo is manipulated in lockstep and now equals [8,2,10,4,6]. – dsimcha Feb 08 '10 at 15:59

4 Answers4

6

Here's an easy way:

perm = sorted(xrange(len(foo)), key=lambda x:foo[x])

This generates a list of permutations - the value in perm[i] is the index of the ith smallest value in foo. Then, you can access both lists in order:

for p in perm:
  print "%s: %s" % (foo[p], bar[p])

You'd need to benchmark it to find out if it's any more efficient, though - I doubt it makes much of a difference.

Nick Johnson
  • 100,655
  • 16
  • 128
  • 198
  • Change `range` to `xrange` if you want to make a difference. Unless you're using Python 3. – Chris Lutz Feb 09 '10 at 07:07
  • Hm, true. Or use .sort instead of sorted, but that ruins the one-liner-ness. ;) – Nick Johnson Feb 09 '10 at 08:49
  • turns out this is no better than sorting them out-of-place, because `sorted` will greedily allocate lots of memory, e.g. `sorted(range(10**6),key=lambda x:x)`. (by range I mean xrange, it's been changed in python3) You will notice a significant chunk of your RAM disappear when you do this. Turns out that `sorted` is smart enough not to sort `range` though, so beware of testing without a `key=` function. – ninjagecko May 23 '11 at 15:35
3

Is there an easy way? Yes. Use zip.

Is there an "easy way that doesn't use a zip variant"? No.

If you wanted to elaborate on why you object to using zip, that would be helpful. Either you're copying objects, in which case Python will copy by reference, or you're copying something so lightweight into a lightweight tuple as to not be worthy of optimization.

If you really don't care about execution speed but are especially concerned for some reason about memory pressure, you could roll your own bubble sort (or your sort algorithm of choice) on your key list which swaps both the key list and the target lists elements when it does a swap. I would call this the opposite of easy, but it would certainly limit your working set.

Jeffrey Harris
  • 3,480
  • 25
  • 30
  • 3
    Just because you can't think of an easy way that doesn't use zip doesn't mean there isn't one - see my answer. :) – Nick Johnson Feb 08 '10 at 18:19
  • Your answer is zipping by another name, so I stand behind "there is no easy way that doesn't use a zip variant". This was a silly question to begin with, though, so if sorting in memory what are essentially tuples of (sort_value, index) is preferable to sorting tuples of (sort_value, target_value), fine. – Jeffrey Harris Feb 09 '10 at 15:57
  • "Zipping by another name"? It's certainly not - it doesn't have anything to do with zipping, and doesn't modify the original elements at all. Indeed, it doesn't even touch the second array. – Nick Johnson Feb 10 '10 at 09:42
  • Zipping just means constructing an iterable of tuples from two or more source iterables. What do you think sorted(indices, key=foo) is doing? It's constructing a list of tuples of (foo_val, index), then sorting them. That is indistinguishable from zipping (foo_val, target_val) and sorting. – Jeffrey Harris Feb 10 '10 at 15:28
0

To achieve this, you would have to implement your own sort.

However: Does the unnecessary copying really hurt your application? Often parts of Python strike me as inefficient, too, but they are efficient enough for what I need.

bayer
  • 6,854
  • 24
  • 35
  • I see your point about avoiding premature optimization, but sometimes (this being one such case) I like to write generic code and know that if I ever use it on a huge dataset or something it will "just work". In this case I'm more worried about running out of memory than about speed. – dsimcha Feb 08 '10 at 16:16
  • and wouldn't *your own sort* involve use of `zip`, `dict`, etc? – SilentGhost Feb 08 '10 at 16:17
  • No. Say you implement your own quicksort - you can make sure to make any swaps on both lists. – bayer Feb 08 '10 at 16:21
  • I would suspect any parallel sort implemented in Python for this would use a lot more time and a little more memory. The big-O speed and space cannot be improved, and you're adding a lot of overhead to do this in pure Python. – Mike Graham Feb 08 '10 at 16:41
  • @dimcha, If you have a large dataset and are running out of memory, the solution may be this, but it may be to use a numpy array, or an `array.array`, or a database, etc. Consider whether this is the solution that will really help. – Mike Graham Feb 08 '10 at 16:43
  • @Mike - Why would it use more memory? – bayer Feb 08 '10 at 16:58
  • @bayer, Because you would have to keep track of what you are doing in Python with several new Python objects, but the sorting implementation used in CPython has been optimized in C using C variables when possible. Like I said, I "suspect" this; I don't have a rigorous proof, let alone a shred of evidence. – Mike Graham Feb 08 '10 at 17:22
  • That additional memory would certainly be O(1) and not O(n). I am totally with you, using the Python sort stuff is smarter. However, if he has very hard memory requirements, this might be a different story... – bayer Feb 08 '10 at 17:24
0

Any solution I can imagine short of introducing a sort from scratch uses indices, or a dict, or something else that really is not apt to save you memory. In any event, using zip will only increase memory usage by a constant factor, so it is worth making sure this is really a problem before a solution.

If it does get to be a problem, there may be more effective solutions. Since the elements of foo and bar are so closely related, are you sure their right representation is not a list of tuples? Are you sure they should not be in a more compact data structure if you are running out of memory, such as a numpy array or a database (the latter of which is really good at this kind of manipulation)?

(Also, incidentally, itertools.izip can save you a little bit of memory over zip, though you still end up with the full zipped list in list form as the result of sorted.)

Mike Graham
  • 73,987
  • 14
  • 101
  • 130