Getting the keys of items with the least counts from a list of tuples of key-value pairs - Python

Question

The input is an unsorted list of tuples:

x = [('herr', 1),
     ('dapao', 1),
     ('cino', 1),
     ('o', 38),
     ('tiao', 2),
     ('tut', 1),
     ('poh', 6),
     ('micheal', 1),
     ('orh', 1),
     ('horlick', 3),
     ('si', 1),
     ('tai', 1),
     ('titlo', 1),
     ('siew', 17),
     ('da', 1),
     ('halia', 2)]

The goal is to find the last n keys with the least counts, i.e. desired output:

['orh', 'si', 'tai', 'titlo', 'da']

I've tried doing this by:

first convert the list of tuples to a dict
cast the dict into a Counter
then find the [-n:] list of tuples from the Counter.most_common()
cast the list of tuples from the [-n:] to a dict
get the keys and then convert it into a list

i.e.

n = 5
list(dict(Counter(dict(x)).most_common()[-n:]).keys())

Is there a less convoluted way to get the same output?

I could also do this:

from operator import itemgetter
output, *_ = zip(*sorted(x, key=itemgetter(1))[n:])
list(output)

But now I've merely swapped out the Counter.most_common with sorted and itemgetter. Then I would still need to zip(*list) to extract the keys through unpacking the first value from each list of tuples after the zip.

There must be a simpler way.

NOTE

Note that the question is not asking to sort, it's to extract the list first element in the original list of tuples given. And the criterion to extract is based on the last nth items with the lowest value in the 2nd element.

The answers from the possible duplicate linked still requires the step to unpack the list of sorted tuples and and the extract the top nth of the list of first elements.

[Possible duplicate](https://stackoverflow.com/questions/3121979/how-to-sort-list-tuple-of-lists-tuples) — Lomtrur, Feb 12 '18 at 11:19
how those values `'kup', 'gor', 'beer', 'hor', 'jia'` correlate with your input tuples? — RomanPerekhrest, Feb 12 '18 at 11:20
@Lomtrur, after the sort by values, there's still more steps to get the last/first few keys =) — alvas, Feb 12 '18 at 11:25
@Goyo , RomanPerekhrest, pardon the mistake copy+paste. Corrected the output. — alvas, Feb 12 '18 at 11:26
Possible duplicate of [Sort a list of tuples by 2nd item (integer value)](https://stackoverflow.com/questions/10695139/sort-a-list-of-tuples-by-2nd-item-integer-value) — Yuvraj Jaiswal, Feb 12 '18 at 11:35
Note that the question is not asking to sort, it's to extract the list first element in the original list of tuples given. And the criterion to extract is based on the last nth items with the lowest value in the 2nd element. — alvas, Feb 12 '18 at 11:37
The answers from the possible duplicate linked still requires the step to unpack the list of sorted tuples and and the extract the top nth of the list of first elements. — alvas, Feb 12 '18 at 11:38
does ordering of output matter? e.g. are `[a, c, b]` and `[c, b, a]` for "last 3" equivalent. — jpp, Feb 12 '18 at 12:12
Nope, it doesn't matter. Just counts the value, we can let Python handle how things are sorted beyond the value of the 2nd element in the original list of tuple. — alvas, Feb 12 '18 at 12:15
Related: https://stackoverflow.com/questions/33623184/fastest-method-of-getting-k-smallest-numbers-in-unsorted-list-of-size-n-in-pytho — Ilja Everilä, Feb 12 '18 at 12:24
Maybe dumb question: If you had `[('a', 1), ('b', 2), ('c', 3), ('d', 2), ('e', 3)]` and requested `4` elements, what would the desired output be? The `a` tuple (the one minimum value), or the `a, b, d, e` tuples (Four elements, prioritized primarily by minimum value and secondarily by position in the list)? — MutantOctopus, Mar 01 '18 at 01:15
So *which* minimum values are picked is irrelevant? Your question states "the **last** `n` keys", which would suggest that `e` should be prioritized over `c` (since it comes later in the list), and the example output you posted seemed to confirm that. I just want to confirm before I review the answers or suggest my own. — MutantOctopus, Mar 01 '18 at 01:20

saaj · Answer 1 · 2018-03-01T15:26:48.803

5

The goal is to find the last n keys with the least counts

Given this goal's definition neither of your two solutions fit. In one with Counter you use dict which will make the order of keys undefined and you will not get the last keys, but some n keys with least value. The second solution has incorrect slicing, and if it's fixed it returns first n keys with least value.

Taking into account that sorted implementation is stable it can be rewritten like this to fit the goal:

def author_2():
    output, *_ = zip(*sorted(reversed(l), key=lambda v: v[1])[:n])
    return list(reversed(output))

But it's a better idea to use heapq, which is the stdlib tool for questions like "n smallest/greatest values from an iterable" (as Martijn Pieters pointed out, nlargest and nsmallest are also stable and the docs really say that, but in implicit way). Especially if the real list you have to deal with is big (for small n it should be faster that sorted as docs describe).

def prop_1():
    rev_result = heapq.nsmallest(n, reversed(l), key=lambda v: v[1])
    return [item[0] for item in rev_result][::-1]

You can improve performance even further, but at the cost of order (sorting stability), i.e. some n keys with least value instead of last n keys with least value. To do that you need to keep a "heapified" list and use it as your internal data structure instead of plain list (if you don't change the list and need the bottom-n only once, it will not give a performance benefit). You can push and pop from the list, for example:

_p2_heap = None

def prop_2():
    global _p2_heap
    if not _p2_heap:
        _p2_heap = []
        for item in l:
            heapq.heappush(_p2_heap, item[::-1])

    return [item[1] for item in heapq.nsmallest(n, _p2_heap)]

Here's the complete module you can use to benchmark the solutions.

import heapq
from collections import Counter  

l = [
    ('herr', 1), ('dapao', 1),
    ('cino', 1), ('o', 38),
    ('tiao', 2), ('tut', 1),
    ('poh', 6), ('micheal', 1),
    ('orh', 1), ('horlick', 3),
    ('si', 1), ('tai', 1),
    ('titlo', 1), ('siew', 17),
    ('da', 1), ('halia', 2)
]
n = 5    

def author_1():
    return list(dict(Counter(dict(l)).most_common()[-n:]).keys())

def author_2():
    output, *_ = zip(*sorted(reversed(l), key=lambda v: v[1])[:n])
    return list(reversed(output))

def prop_1():
    rev_result = heapq.nsmallest(n, reversed(l), key=lambda v: v[1])
    return [item[0] for item in rev_result][::-1]

_p2_heap = None    
def prop_2():
    global _p2_heap
    if not _p2_heap:
        _p2_heap = []
        for item in l:
            heapq.heappush(_p2_heap, item[::-1])

    return [item[1] for item in heapq.nsmallest(n, _p2_heap)][::-1]

And here are the timeit results:

$ python -m timeit -s "import tst" "tst.author_1()"
100000 loops, best of 3: 7.72 usec per loop
$ python -m timeit -s "import tst" "tst.author_2()"
100000 loops, best of 3: 3.7 usec per loop
$ python -m timeit -s "import tst" "tst.prop_1()"
100000 loops, best of 3: 5.51 usec per loop
$ python -m timeit -s "import tst" "tst.prop_2()"
100000 loops, best of 3: 3.96 usec per loop

But if we make l = l * 1000 the difference will become noticeable:

$ python -m timeit -s "import tst" "tst.author_1()"
1000 loops, best of 3: 263 usec per loop
$ python -m timeit -s "import tst" "tst.author_2()"
100 loops, best of 3: 2.72 msec per loop
$ python -m timeit -s "import tst" "tst.prop_1()"
1000 loops, best of 3: 1.65 msec per loop
$ python -m timeit -s "import tst" "tst.prop_2()"
1000 loops, best of 3: 767 usec per loop

edited Mar 01 '18 at 15:26

answered Feb 22 '18 at 13:49

saaj

23,253
3
104
105

Your `prop_2` test is fatally flawed, because `_p2_heap` is global and **reused for the timed tests**. The `heapq.nsmallest()` function **already** creates that heap, and the only reason that `prop_2` is faster is because the global allows it to cache state for the trial runs. – Martijn Pieters Mar 01 '18 at 08:19
That said, this is absolutely a heapq problem, where using `heapq.nsmallest` gives you a O(K log N) solution (where K is the size of the input, and N is the desired number of items). Sorting gives you a O(K log K) solution, the heapq will easily beat it. – Martijn Pieters Mar 01 '18 at 08:22
@MartijnPieters If you paid enough attention to the answer your could notice *you can improve performance if you keep a "heapified" list. You can push and pop from the list*. This obviously was the intention to show that if it's used as internal data structure (what you call *cached*), it has performance benefit. `heapq` docs also say for `nsmallest` and for `nlargest` *if repeated usage of these functions is required, consider turning the iterable into an actual heap*. This was demonstrated. – saaj Mar 01 '18 at 09:23
There were more issues: *If you really don't have requirement of last n keys, it can a better idea to use heapq*. That's the opposite of what heapq gives, they *do* have that requirement, and it is *because* they have that requirement that they should use heapq. That's perhaps an editing or grammar error? – Martijn Pieters Mar 01 '18 at 09:40
*But note that the sorting it does is not stable*: this doesn't matter here, you are not doing a full sort. If there is a tie (more values have the same lowest value), the top N is not necessarily in the same order as the input, but for a given input, the same top N values are picked, consistently. If input order is important in this, it is trivial to add a counter to the key function. – Martijn Pieters Mar 01 '18 at 09:41
1

I did misunderstand what you meant with 'keep a heapified list'. Please do expand on that, to make it explicit that the performance improvement is gained only when you need to produce the top N more than once with the input sequence being updated. – Martijn Pieters Mar 01 '18 at 09:43
(put differently, if you only need to use the top N result repeatedly, and the input sequence *doesn't change*, then there is no point in using a heapified list, and call `nsmallest()` again and again, because you *already produced the desired top N list*). – Martijn Pieters Mar 01 '18 at 09:44
@MartijnPieters Now the feedback is thorough, thanks ;-) I'll address every comment. On grammar. The OP said he wants *the last n keys with the least counts* (1), but his solutions, which supposedly worked for him, provide some least keys, say, in arbitrary order (2). So I say if it's (2), he can use `heapq`. I don't see an issue here, but I can expand it. – saaj Mar 01 '18 at 10:02
@MartijnPieters On stability. I wouldn't stand on that sorting stability of `heapq` has something to do with the order of tuples with same key, but my point was the same as your *the top N is not necessarily in the same order as the input*. Having stateful key function isn't a common thing, or something I'd call trivial. Also I think it may have a performance implication for pre-hepified case. Anyway, should I replace stability with something else relevant here which explains why the order has changed? On 'keep a heapified list'. I'll expand. – saaj Mar 01 '18 at 10:07
@MartijnPieters Expanded. Does it look better now? – saaj Mar 01 '18 at 10:21
1

Actually, looking at the implementation I forgot that `nlargest` and `nsmallest` *already* add a counter to break ties, making those implementations *stable*. `nlargest()` counts down, `nsmallest()` counts up. Only when you use the `heapsort` function from the documentation will you have an unstable sort. – Martijn Pieters Mar 01 '18 at 13:54
This is actually documented, albeit implicit. The functions are documented to be equivalent to using `sorted()` with a slice. Since `sorted()` is stable, so must the `nsmallest` and `nlargest` implementations be to remain equivalent. – Martijn Pieters Mar 01 '18 at 13:56
@MartijnPieters That's true, I've missed that. Reworked `prop_1`. I've learned something new about Python, thanks. – saaj Mar 01 '18 at 15:28

score 2 · Answer 2 · edited Mar 01 '18 at 00:47

2

Just use a heap, it will give you the desired output.

import heapq

x = [('herr', 1),
('dapao', 1),
('cino', 1),
('o', 38),
('tiao', 2),
('tut', 1),
('poh', 6),
('micheal', 1),
('orh', 1),
('horlick', 3),
('si', 1),
('tai', 1),
('titlo', 1),
('siew', 17),
('da', 1),
('halia', 2)]

heap = [(item[1],-index,item[0]) for index, item in enumerate(x)]
heapq.heapify(heap)

print(list(map(lambda item : item[2], heapq.nsmallest(5, heap))))

heapq.nsmallest(n, iterable, key=None)has a key argument,you can use it inside of using -index like me.

edited Mar 01 '18 at 00:47

whackamadoodle3000

6,684
4
27
44

answered Feb 23 '18 at 09:14

obgnaw

3,007
11
25

This is absolutely a heapq problem. – Martijn Pieters Mar 01 '18 at 08:20

score 1 · Answer 3 · edited Mar 01 '18 at 00:44

1

Edit @alvas :

mi = min(x, key =lambda x:x[1])[1]
r = [a[0] for a in x if a[1] == mi][-5:]

Will generate the output you desire

You can use this:

sorted(x, key=lambda x: x[1])

Please refer to this (possible duplicate)

Sort a list of tuples by 2nd item (integer value)

edited Mar 01 '18 at 00:44

whackamadoodle3000

6,684
4
27
44

answered Feb 12 '18 at 11:21

Yuvraj Jaiswal

1,605
13
20

This will sort the values in order of the keys, he can loop and get the last n keys using `sorted(x, key=lambda x: x[1])[-n:]` – Yuvraj Jaiswal Feb 12 '18 at 11:36
@vaultah edited. This produces the output alvas desires. – Yuvraj Jaiswal Feb 12 '18 at 11:52
@vaultah can you provide the full list, I'll check. Also, this will give those with min counts, if alvas needs all in ascending order, he'll have to go by second approach of sorting. – Yuvraj Jaiswal Feb 12 '18 at 12:01
@YuvrajJaiswal No, both these solutions produce the wrong answer. – Joe Iddon Feb 12 '18 at 12:29

score 1 · Answer 4 · edited Mar 01 '18 at 00:47

Here is my suggestion:

n = 5
output=[]

# Search and store the n least numbers
leastNbs = [a[1] for a in sorted(x, key=lambda x: x[1])[:n]]

# Iterate over the list of tuples starting from the end
# in order to find the tuples including one of the n least numbers
for x,nb in reversed(x):
    if nb in leastNbs:
        output.append(x)  # Store the string in output
        print(x)

# Keep only the n last strings (starting from the end)
output = list(reversed(output[:n]))

print(output)

score 1 · Answer 5 · edited Mar 01 '18 at 00:45

You don't really need any imports for this task, you can also do it the following way:

x = [('herr', 1),
     ('dapao', 1),
     ('cino', 1),
     ('o', 38),
     ('tiao', 2),
     ('tut', 1),
     ('poh', 6),
     ('micheal', 1),
     ('orh', 1),
     ('horlick', 3),
     ('si', 1),
     ('tai', 1),
     ('titlo', 1),
     ('siew', 17),
     ('da', 1),
     ('halia', 2)]

n = 5
result = [name[0] for name in sorted(x, key=lambda i: i[1], reverse=True)[-n:]]
print(result)

Output:

['orh', 'si', 'tai', 'titlo', 'da']

score 1 · Answer 6 · edited Mar 01 '18 at 00:44

You could use pandas if you don't feel like reinventing the wheel. Performance should be excellent since it's based on NumPy, which uses C under the hood instead of pure Python.

Short answer

df = pd.DataFrame(x, columns=['name', 'count'])
df = df.sort_values(by='count', kind='mergesort', ascending=False).tail(n)
print df['name'].tolist()

Result

['orh', 'si', 'tai', 'titlo', 'da']

Expanded, working example with comments

import pandas as pd

n = 5
x = [('herr', 1),
     ('dapao', 1),
     ('cino', 1),
     ('o', 38),
     ('tiao', 2),
     ('tut', 1),
     ('poh', 6),
     ('micheal', 1),
     ('orh', 1),
     ('horlick', 3),
     ('si', 1),
     ('tai', 1),
     ('titlo', 1),
     ('siew', 17),
     ('da', 1),
     ('halia', 2)]

# Put the data in a dataframe.
df = pd.DataFrame(x, columns=['name', 'count'])

# Get the last n rows having the smallest 'count'.
# Mergesort is used instead of quicksort (default) since a stable sort is needed
# to get the *last* n smallest items instead of just *any* n smallest items.
df = df.sort_values(by='count', kind='mergesort', ascending=False).tail(n)

# Print the 'name' column as a list (since a list is what you asked for).
print df['name'].tolist()

score 1 · Answer 7 · edited Mar 01 '18 at 00:47

This is a clean, simple approach without using python idioms:

m = x[0][1]
l = []

for elem in x:
    if m > elem[1]:
        l = [elem[0]]
        m = elem[1]
    elif m == elem[1]:
        l.append(elem[0])

print(l[-5:])

It is kind of like a fusion of min-value search plus filtering. m stores the min value up till now, and l stores the list of elements which have that min count. You reset them when you find a lower value.

This can be modified to hold only 5 elements, so no splicing would be required in the end.

score 1 · Answer 8 · edited Mar 01 '18 at 00:43

1

[k for k,v in sorted(x, key=lambda x: x[1])[:n]]

Where x is list of key, count tuples and n is desired number of keys.

You may also adjust the sort criteria to include the keys themselves- if their order is important

[k for k,v in sorted(x, key=lambda x: (x[1], x[0]))[:n]]

edited Mar 01 '18 at 00:43

whackamadoodle3000

6,684
4
27
44

answered Feb 28 '18 at 19:36

stacksonstacks

8,613
6
28
44

score 1 · Answer 9 · edited Mar 01 '18 at 00:45

1

[i[0] for i in sorted(x.__reversed__(), key=lambda x: x[1])[:n]]

Almost exactly the same as @Stacksonstacks answer, just that this actually gives you the 'desired output' (if you put n = 5)

edited Mar 01 '18 at 00:45

whackamadoodle3000

6,684
4
27
44

answered Feb 28 '18 at 22:25

pooh17

64
4

score 0 · Answer 10 · edited Mar 01 '18 at 00:45

A Pure Python Solution

Since we are trying to find the n elements in order of min to max, we cannot simply filter out those which do not have the smallest second element. We also have the second aim of trying to maintain order - this eliminates merely sorting on the second element of each tuple.

My solution has complexity O(n) - which is the best you can do here as we are creating a new list which is dependent on a pre-existing list.

It works by creating a set (unordered) of the first n elements of each tuple in x - after x has been reversed ([::-1]) and then sorted based on the second element. This has the neat trick that since we are slicing before we convert to the set, there is still order within those tuples with equivalent second elements.

Now, the neatness of using a set is that lookup is O(1) (instant) as the elements are stored in order of their hashes, so calling __contains__ is much faster than with a list.

We finally need to use a list-comprehension to carry out the final filtering of x:

>>> n = 5
>>> s = {i[0] for i in sorted(x[::-1], key=lambda t: t[1])[:n]}
>>> [i for i, _ in x if i in s]
['orh', 'si', 'tai', 'titlo', 'da']

Also a test to shows that it works with n = 11

['herr', 'dapao', 'cino', 'tut', 'micheal', 'orh', 'si', 'tai', 'titlo', 'da', 'halia']

@vaultah Oh! My bad, I didn't realise the OP meant least as *in order* or smallest not literally those with *the smallest*. — Joe Iddon, Feb 12 '18 at 11:57

score 0 · Answer 11 · edited Mar 01 '18 at 00:47

Using list comprehension and sorted:

[key for key,value in sorted(x, key=lambda y: y[1], reverse=True)][-n:]

or

[key for key,value in sorted(reversed(x), key=lambda y: y[1])][:n][::-1]

where n is the number of keys you want in the resultant. Note that using the latter, with [::-1], is more expensive because it slices the list once more to reverse it back.

from timeit import default_timer

def timeit(method, *args, **kwargs):
    start = default_timer()
    result = method(*args, **kwargs)
    end = default_timer()
    print('%s:\n(timing: %fs)\n%s\n' % (method.__name__, (end - start), result))

def with_copy(x, n):
    return [key for key,value in sorted(reversed(x), key=lambda y: y[1])][:n][::-1]

def without_copy(x, n):
    return [key for key,value in sorted(x, key=lambda y: y[1], reverse=True)][-n:]

x = [('herr', 1), ('dapao', 1), ('cino', 1), ('o', 38), ('tiao', 2),
     ('tut', 1), ('poh', 6), ('micheal', 1), ('orh', 1), ('horlick', 3),
     ('si', 1), ('tai', 1), ('titlo', 1), ('siew', 17), ('da', 1),
     ('halia', 2)]
n = 5
timeit(with_copy, x, n)
timeit(without_copy, x, n)
n = 11
timeit(with_copy, x, n)
timeit(without_copy, x, n)

Results with `n = 5`:

with_copy:
(timing: 0.000026s)
['orh', 'si', 'tai', 'titlo', 'da']

without_copy:
(timing: 0.000018s)
['orh', 'si', 'tai', 'titlo', 'da']

Results with `n = 11`:

with_copy:
(timing: 0.000019s)
['halia', 'herr', 'dapao', 'cino', 'tut', 'micheal', 'orh', 'si', 'tai', 'titlo', 'da']

without_copy:
(timing: 0.000013s)
['halia', 'herr', 'dapao', 'cino', 'tut', 'micheal', 'orh', 'si', 'tai', 'titlo', 'da']

Jai · Answer 12 · 2018-03-01T18:00:00.197

No Need of sorting in this solution

Small Solution:

import numpy as np 
n = 5
x = [('herr', 1),
     ('dapao', 1),
     ('cino', 1),
     ('o', 38),
     ('tiao', 2),
     ('tut', 1),
     ('poh', 6),
     ('micheal', 1),
     ('orh', 1),
     ('horlick', 3),
     ('si', 1),
     ('tai', 1),
     ('titlo', 1),
     ('siew', 17),
     ('da', 1),
     ('halia', 2)]

x = np.array(x)  # make the list a numpy array
names = x[:, 0]   
numbers = x[:, 1].astype(int)
least_count = np.take(names, np.where(numbers == np.min(numbers)))[0][-n:]
print(least_count)

output of above solution:
```
['orh', 'si', 'tai', 'titlo', 'da']
```

Explanation of solution with comments

import numpy as np 

x = [('herr', 1),
 ('dapao', 1),
 ('cino', 1),
 ('o', 38),
 ('tiao', 2),
 ('tut', 1),
 ('poh', 6),
 ('micheal', 1),
 ('orh', 1),
 ('horlick', 3),
 ('si', 1),
 ('tai', 1),
 ('titlo', 1),
 ('siew', 17),
 ('da', 1),
 ('halia', 2)]

x = np.array(x)  # make the list a numpy array
# ==========================================
# split the array into names and numbers
# ==========================================
names = x[:, 0]   
numbers = x[:, 1].astype(int)

mini = np.min(numbers)  # find the minimum in the numbers array
idx = np.where(numbers == mini)   # Find the indices where minimum occurs in the numbers array
least_count = np.take(names, idx)[0] # Use the indices found from numbers array in the above line to access names array
print(least_count)
least_count = least_count.tolist()  # to convert the numpy array to list
n = 5   # say n is 5
print(least_count[-n:]) # now you can do simple slicing to extract the last n element

output of above explaination:

['herr' 'dapao' 'cino' 'tut' 'micheal' 'orh' 'si' 'tai' 'titlo' 'da']
['orh', 'si', 'tai', 'titlo', 'da']

Getting the keys of items with the least counts from a list of tuples of key-value pairs - Python

NOTE

12 Answers12

Short answer

Result

Expanded, working example with comments

Results with n = 5:

Results with n = 11:

Results with `n = 5`:

Results with `n = 11`: