Sorting a list of lists by item frequency in Python 2.3

Question

I have a list, with sub-lists of items like this.

mylist = [
['ITEM A', 'YES', 'NO', 'YES', 'YES', 'NO', 'NO', 'NO', 'NO', 'NO'],
['ITEM B', 'YES', 'NO', 'YES', 'YES', 'NO', 'NO', 'NO', 'NO', 'MAYBE'],
['ITEM C', 'YES', 'YES', 'YES', 'YES', 'NO', 'NO', 'MAYBE', 'NO', 'MAYBE']
]

Now I want to sort the sub lists on this condition — the more each row (i.e. sub list) has the items 'YES' and 'MAYBE', the higher it moves up. The more 'NO's each row has, the lower it moves in the sorting list.

The ideal result would be —

mylist = [
['ITEM C', 'YES', 'YES', 'YES', 'YES', 'NO', 'NO', 'MAYBE', 'NO', 'MAYBE'],
['ITEM B', 'YES', 'NO', 'YES', 'YES', 'NO', 'NO', 'NO', 'NO', 'MAYBE'],
['ITEM A', 'YES', 'NO', 'YES', 'YES', 'NO', 'NO', 'NO', 'NO', 'NO']
]
#Item C has 4 'YES' and 2 'MAYBE'
#Item B has 3 'YES' and 1 'MAYBE'
#Item C has 3 'YES'

Sadly, I'm stuck on Python 2.3, and need to find out the most efficient way to do this.

@Nima sometimes people as stuck with older versions at work (and lack of / resistance to upgrades by the people making decisions). I worked at a place like that, so OP may not have a choice. — Levon, Jun 18 '12 at 13:58
@Levon Well said. I'm working on a legacy system, developing in 2.3 currently. It blows, but it's necessary. — , Jul 16 '18 at 16:54

senderle · Answer 1 · 2012-11-28T02:29:26.990

To sort by a key in Python 2.3 or lower, you could use the cmp parameter. But sometimes key style sorting is easier to read; and in any case, it does less work, since cmp will be called O(n log n) times, while the key function will be called only O(n) times.

With that in mind, here's a way to reproduce the behavior of the key parameter in later versions of Python. It uses the decorate-sort-undecorate idiom, a.k.a. the Schwartzian Transform. This won't be quite as space efficient because it makes copies, but for large lists, it is likely to be more time efficient. I've named this sorted because it roughly reproduces the sorted function added in 2.4; check the python version and conditionally import this so that you don't smash the built-in sorted in newer versions -- or just rename it.

def sorted(seq, key=lambda x: None, reverse=False):
    seq = [(key(x), i, x) for i, x in enumerate(seq)]
    seq.sort()
    if reverse:
        seq.reverse()
    return [x for k, i, x in seq]

Note that enumerate is only necessary if you care about having a stable sort on unequal values with equal keys; it slows down the function by a hair. Tested on your data:

>>> key=lambda x: (x.count('YES'), x.count('MAYBE'), x.count('NO'))
>>> my_sorted(mylist, key=key, reverse=True)
[['ITEM C', 'YES', 'YES', 'YES', 'YES', 'NO', 'NO', 'MAYBE', 'NO', 'MAYBE'], 
 ['ITEM B', 'YES', 'NO', 'YES', 'YES', 'NO', 'NO', 'NO', 'NO', 'MAYBE'], 
 ['ITEM A', 'YES', 'NO', 'YES', 'YES', 'NO', 'NO', 'NO', 'NO', 'NO']]

You might also consider using a dictionary to do your counting; that way, only one pass is required. However, count is sufficiently optimized that three passes are still faster than one Python for loop, at least on my machine. So only use this if you need to count lots of values. I'll leave this here for posterity:

def my_key(inner_list):
    counts = {'YES':0, 'MAYBE':0, 'NO':0}
    for i in inner_list:
        if i in counts:
            counts[i] += 1
    return (counts['YES'], counts['MAYBE'], counts['NO'])

I did some testing; apologies for the long post. The below is only for the curious and inquisitive.

My tests indicate that on the smaller list, decorate, sort, undecorate is already faster than using the built-in sort + cmp. On a bigger list, the difference becomes more dramatic. Definitions:

def key_count(x):
    return (x.count('YES'), x.count('MAYBE'), x.count('NO'))

def key_dict(inner_list):
    counts = {'YES':0, 'MAYBE':0, 'NO':0}
    for i in inner_list:
        if i in counts:
            counts[i] += 1
    return (counts['YES'], counts['MAYBE'], counts['NO'])

def decorate_sort(seq, key=lambda x: None, reverse=False):
    seq = [(key(x), i, x) for i, x in enumerate(seq)]
    seq.sort()
    if reverse:
        seq.reverse()
    return [x for k, i, x in seq]

def builtin_sort(seq, key, reverse=False):
    seq.sort(lambda p, q: cmp(key(p), key(q)))
    if reverse:
        seq.reverse()

Tests:

>>> mylist = [
... ['ITEM A', 'YES', 'NO', 'YES', 'YES', 'NO', 'NO', 'NO', 'NO', 'NO'],
... ['ITEM B', 'YES', 'NO', 'YES', 'YES', 'NO', 'NO', 'NO', 'NO', 'MAYBE'],
... ['ITEM C', 'YES', 'YES', 'YES', 'YES', 'NO', 'NO', 'MAYBE', 'NO', 'MAYBE']
... ]
>>> %timeit decorate_sort(mylist, key=key_count, reverse=True)
100000 loops, best of 3: 5.03 us per loop
>>> %timeit builtin_sort(mylist, key=key_count, reverse=True)
100000 loops, best of 3: 5.28 us per loop

The built-in version is already slower! The less-generalized version mylist.sort(lambda p, q: -cmp(key(p), key(q))) is a better by a hair for a short list, because of the addition of enumerate to decorate_sort; without it, decorate_sort is faster (4.28 us per loop in my previous test):

>>> %timeit mylist.sort(lambda p, q: -cmp(key_count(p), key_count(q)))
100000 loops, best of 3: 4.74 us per loop

Using key_dict is a mistake in this case, though:

>>> %timeit decorate_sort(mylist, key=key_dict, reverse=True)
100000 loops, best of 3: 8.97 us per loop
>>> %timeit builtin_sort(mylist, key=key_dict, reverse=True)
100000 loops, best of 3: 11.4 us per loop

Testing it on a larger list, basically the same results hold:

>>> import random
>>> mylist = [[random.choice(('YES', 'MAYBE', 'NO')) for _ in range(1000)] 
              for _ in range(100)]
>>> %timeit decorate_sort(mylist, key=key_count, reverse=True)
100 loops, best of 3: 6.93 ms per loop
>>> %timeit builtin_sort(mylist, key=key_count, reverse=True)
10 loops, best of 3: 34.5 ms per loop

The less generalized version is now slower than decorate_sort.

>>> %timeit mylist.sort(lambda p, q: -cmp(key_count(p), key_count(q)))
100 loops, best of 3: 13.5 ms per loop

And key_dict is still slower. (But faster than builtin_sort!)

>>> %timeit decorate_sort(mylist, key=key_dict, reverse=True)
10 loops, best of 3: 20.4 ms per loop
>>> %timeit builtin_sort(mylist, key=key_dict, reverse=True)
10 loops, best of 3: 103 ms per loop

So the upshot is that the Schwartzian Transform provides a solution that is both faster and more generalized -- a rare and wonderful combination.

Thanks for the fantastically detailed analysis. I will check all of these. Thanks a lot. — GPX, Jun 19 '12 at 08:11
Unfortunately, your Schwartzian transform can break the stable sort guarantee (if `key(x)` and `key(y)` compare equal but `x` and `y` compare unequal). An easy fix is to store the index in the transformed tuple: `seq = [(key(x), i, x) for i, x in enumerate(seq)]` — ecatmur, Jun 19 '12 at 08:46
Also, it doesn't allow providing *both* a `key` and `cmp` - if you'd ever want to! I've extended your solution with both fixes in my answer. — ecatmur, Jun 19 '12 at 09:04
@ecatmur, yeah, I kind of left `cmp` out on purpose. If you want to use `cmp`, just use the built-in version, which will be faster than anything coded in Python. Also, I've seen people use both `cmp` and `key` before, and I thought it was a bad idea. I prefer the Python 3 way of doing things in this case. Still, thanks for reminding me about sort stability. — senderle, Jun 19 '12 at 13:48

ecatmur · Accepted Answer · 2012-06-19T09:02:35.657

2

General solution: use list.sort with a key function returning a tuple:

mylist.sort(key=lambda sl: (sl.count('YES') + sl.count('MAYBE'), -sl.count('NO')), reverse=True)

key and reverse were added in Python 2.4, so you'll have to do it manually:

key = lambda sl: (sl.count('YES') + sl.count('MAYBE'), -sl.count('NO'))
mylist.sort(lambda p, q: -cmp(key(p), key(q)))

If key is slow it's best to use a solution that computes the key function on each item only once (a so-called "Schwartzian transform"). Note that >=Python 2.4 perform this optimisation (or similar) already:

def key_sort(seq, cmp=None, key=None, reverse=False):
    if key is not None:
        transform = [(key(x), i, x) for i, x in enumerate(seq)]
        transform.sort(None if cmp is None else lambda (k, _, _), (l, _, _): cmp(k, l))
        seq[:] = [x for _, _, x in transform]
    else:
        seq.sort(cmp)
    if reverse:
        seq.reverse()

edited Jun 19 '12 at 09:02

answered Jun 18 '12 at 13:39

ecatmur

152,476
27
293
366

+1, but the OP may need to play around with `key`. The main idea is that it should return a tuple. I'd suggest something like `(sl.count('YES'), sl.count('MAYBE'), -sl.count('NO'))` in case `'YES'` is any better than `'MAYBE'`. – Lev Levitsky Jun 18 '12 at 13:42
1

I think sorting [works a bit differently](http://docs.python.org/release/2.3/lib/typesseq-mutable.html) in python2.3 – mata Jun 18 '12 at 13:44
Add some weights for the YES and MAYBE. Say, (100, 10, -100) for (YES, MAYBE, NO) – Ribtoks Jun 18 '12 at 13:50
@Ribtoks that works if 10 MAYBE count as 1 YES; the tuple approach is necessary if successes and failures are to be considered separately. – ecatmur Jun 18 '12 at 13:56

Sorting a list of lists by item frequency in Python 2.3

2 Answers2

Linked