The goal is to find the last n
keys with the least counts
Given this goal's definition neither of your two solutions fit. In one with Counter
you use dict
which will make the order of keys undefined and you will not get the last keys, but some n
keys with least value. The second solution has incorrect slicing, and if it's fixed it returns first n
keys with least value.
Taking into account that sorted
implementation is stable it can be rewritten like this to fit the goal:
def author_2():
output, *_ = zip(*sorted(reversed(l), key=lambda v: v[1])[:n])
return list(reversed(output))
But it's a better idea to use heapq
, which is the stdlib tool for questions like "n smallest/greatest values from an iterable" (as Martijn Pieters pointed out, nlargest
and nsmallest
are also stable and the docs really say that, but in implicit way). Especially if the real list you have to deal with is big (for small n
it should be faster that sorted
as docs describe).
def prop_1():
rev_result = heapq.nsmallest(n, reversed(l), key=lambda v: v[1])
return [item[0] for item in rev_result][::-1]
You can improve performance even further, but at the cost of order (sorting stability), i.e. some n
keys with least value instead of last n
keys with least value. To do that you need to keep a "heapified" list and use it as your internal data structure instead of plain list
(if you don't change the list and need the bottom-n only once, it will not give a performance benefit). You can push and pop from the list, for example:
_p2_heap = None
def prop_2():
global _p2_heap
if not _p2_heap:
_p2_heap = []
for item in l:
heapq.heappush(_p2_heap, item[::-1])
return [item[1] for item in heapq.nsmallest(n, _p2_heap)]
Here's the complete module you can use to benchmark the solutions.
import heapq
from collections import Counter
l = [
('herr', 1), ('dapao', 1),
('cino', 1), ('o', 38),
('tiao', 2), ('tut', 1),
('poh', 6), ('micheal', 1),
('orh', 1), ('horlick', 3),
('si', 1), ('tai', 1),
('titlo', 1), ('siew', 17),
('da', 1), ('halia', 2)
]
n = 5
def author_1():
return list(dict(Counter(dict(l)).most_common()[-n:]).keys())
def author_2():
output, *_ = zip(*sorted(reversed(l), key=lambda v: v[1])[:n])
return list(reversed(output))
def prop_1():
rev_result = heapq.nsmallest(n, reversed(l), key=lambda v: v[1])
return [item[0] for item in rev_result][::-1]
_p2_heap = None
def prop_2():
global _p2_heap
if not _p2_heap:
_p2_heap = []
for item in l:
heapq.heappush(_p2_heap, item[::-1])
return [item[1] for item in heapq.nsmallest(n, _p2_heap)][::-1]
And here are the timeit
results:
$ python -m timeit -s "import tst" "tst.author_1()"
100000 loops, best of 3: 7.72 usec per loop
$ python -m timeit -s "import tst" "tst.author_2()"
100000 loops, best of 3: 3.7 usec per loop
$ python -m timeit -s "import tst" "tst.prop_1()"
100000 loops, best of 3: 5.51 usec per loop
$ python -m timeit -s "import tst" "tst.prop_2()"
100000 loops, best of 3: 3.96 usec per loop
But if we make l = l * 1000
the difference will become noticeable:
$ python -m timeit -s "import tst" "tst.author_1()"
1000 loops, best of 3: 263 usec per loop
$ python -m timeit -s "import tst" "tst.author_2()"
100 loops, best of 3: 2.72 msec per loop
$ python -m timeit -s "import tst" "tst.prop_1()"
1000 loops, best of 3: 1.65 msec per loop
$ python -m timeit -s "import tst" "tst.prop_2()"
1000 loops, best of 3: 767 usec per loop