Python Sorting Dictionary by values can be unstable? (in Hackerrank)

Question

I'm doing some practice in Hackerrank for Python 3 learning.

In the task Most Common you are given a string which contains only lowercase English characters and you need to find the top three most common characters in that string.

I met some questions.

My solution for this problem is below:

#!/bin/python3
import sys

if __name__ == "__main__":
  s = input().strip()
  ch_dict = {}
  for ch in s:
      if ch in ch_dict : ch_dict[ch] +=1
      else: ch_dict[ch] = 1

  result = sorted(ch_dict.items(),key=lambda d:d[1],reverse=True)
  for i in result:
      if i[1] != 1:
          print(" ".join(map(str,i)))

When I test this code in local environment, it works!

But in online test, it may fail!

For this input:

aabbbccde

I submit a lot of times, sometimes get right answer like this:

b 3
a 2
c 2

and can also get this:

b 3
c 2
a 2

It seems like the sort can be unstable ? OR what's matter whit my code ? OR is something wrong in Hackerrank environment ?

How can i guarantee my output?

Dictionaries are *unordered*, and you are sorting on the values only. So when you get *equal values*, their order is the same as the input order. Which is implementation defined and can seem arbitrary. See [Why is the order in dictionaries and sets arbitrary?](//stackoverflow.com/a/15479974) — Martijn Pieters, Nov 03 '17 at 09:21
The second answer is as right as the first, both ar sorted by value. Stable and unstable are not applicable here because there is not a previous order to keep. — Stop harming Monica, Nov 03 '17 at 09:23
@Goyo: well, there is, but that order changes each new invocation of the interpreter because of the random hash seed. — Martijn Pieters, Nov 03 '17 at 09:28
@MartijnPieters I mean the code does not define an order. But you are right, in a particular execution there will be an order and in some cases stability will matter. — Stop harming Monica, Nov 03 '17 at 09:59
The Hackerrank task description also includes a specification that you might have overlooked: "*If the occurrence count is the same, sort the characters in ascending order.*" — user2314737, Dec 15 '17 at 08:13

Martijn Pieters · Accepted Answer · 2017-11-03T12:14:47.860

Python dictionaries are unordered. When you iterate over their contents, the order is implementation dependent, see Why is the order in dictionaries and sets arbitrary?

You are sorting your items by the values only, so given that your list the items in arbitrary order, sometimes the ('a', 2) pair will come first, sometimes the ('c', 2) pair is.

If you want to stabilise the order, break ties between values by sorting on the key as well.

Your challenge states:

Sort output in descending order of occurrence count.
If the occurrence count is the same, sort the characters in ascending order.

so you need to sort by value first, and then by key, and the direction between these two differs.

You can achieve this by sorting twice, or by sorting on the inverse score:

# Sort forward by key, to produce a stable order between keys
by_key = sorted(ch_dict.items(), key=lambda d: d[0])
# Sort backward by value, ties are left in the original order, so by key
result = sorted(by_key, key=lambda d: d[1], reverse=True)

or in one step:

sorted(ch_dict.items(), key=lambda d: (-d[1], d[0]))

so sort by the negative count, then the key, and don't reverse.

Note that the challenge actually asks only for the top three characters. The challenge doesn't use huge inputs, but if there were, then using sorting is actually inefficient. You don't need to sort all key-value pairs, only the top 3. How would you go about getting just the top 3? You could use a heap queue, which efficiently can give you the top N of any sequence:

import heapq

result = heapq.nsmallest(3, ch_dict.items(), key=lambda d: (-d[1], d[0]))

Where sorting takes O(NlogN) time (N being the size of the dictionary), a heapq takes O(NlogK) time, N being the same value but K being the count of top items; here it is 3. For a dictionary with 10.000 items, sorting takes about 133k steps to complete, but a heap queue only takes 16k steps. That's going to be almost 10 times faster!

score 3 · Answer 2 · answered Nov 03 '17 at 09:15

3

The problem is here:

key=lambda d:d[1]

The key only considers the second value, instead, use both values.

answered Nov 03 '17 at 09:15

Reut Sharabani

30,449
6
70
88

Daniel Roseman · Answer 3 · 2017-11-03T09:29:50.990

2

Dictionaries are unordered. You are sorting your output by the value only, but since the order of keys is not guaranteed in the original dict, the ordering within each value in the output can vary.

You can fix this by ordering on both:

sorted(ch_dict.items(), key=lambda d: (d[1], d[0]), reverse=True)

edited Nov 03 '17 at 09:29

answered Nov 03 '17 at 09:16

Daniel Roseman

588,541
66
880
895

This produces the wrong order for the challenge, where ties need to be sorted in ascending order, not descending. – Martijn Pieters Nov 03 '17 at 10:12

score 0 · Answer 4 · answered Nov 03 '17 at 09:18

dict.items can return the (key, value) pairs in any order, dependend on details like the implementation or key-insertion-order. sorted then iterates over these pairs in whatever order dict.items returned them.

If you want a deterministic output, use key=lambda d: (d[1], d[0]) in order to sort the (key, value) pairs by the key lexicographically if the value happens to be the same.

(In case you are using Python 2, key=lambda key, value: (value, key) looks nicer.)

Ilja Everilä · Answer 5 · 2017-11-03T09:32:29.187

sorted() is actually stable in that it preserves the order of items with the same key as extracted by the key function you provided – in this case the key being the value. But since dicts are unordered, the preserved order is undefined for items with the same value.

A solution is to sort by (value, key) tuples:

result = sorted(ch_dict.items(), key=lambda d: (-d[1], d[0]))

Note the removed reversed argument, replaced by negating the value, as it seems that you'd like to sort the keys in ascending order and the values in descending.

B. M. · Answer 6 · 2017-11-03T12:51:07.050

0

in the Hackerrank hierarchy, you are in the Collections section. so the solution is probably :

#!/bin/python3
import sys,collections
if __name__ == "__main__":
    s = 'abcdebbca' # input().strip()
    res=collections.Counter(s).items(s)
    sortres= sorted ( res, key=(lambda x : (-x[1],x[0])))
    for k,v in sortres[:3] : print k,v

the line sortres= sorted ( res, key=(lambda x : (-x[1],x[0]))) is necessary like well explained by @Martijn Pieters.

EDIT

Since problem arise from dict, an other answer which uses only lists, sets and sorted stability :

import sys

if __name__ == "__main__":
    s = raw_input().strip()
    set_k, list_kv = set() , list() 
    for x in sorted(s):
        if x not in set_k:
            if set_k : list_kv.append((-count,val))
            set_k.add(x)
            count , val = 0 , x
        count+=1
    for k,v in sorted(list_kv)[:3] : print v,-k

edited Nov 03 '17 at 12:51

answered Nov 03 '17 at 11:06

B. M.

18,243
2
35
54

`Counter.most_common()` doesn't include the key in the sort, it is prone to the very same issue. Depending on the current random hash seed, either `c` or `a` will be listed first (on Python < 3.6), and for different strings the insertion order comes into play too (all Python versions). – Martijn Pieters Nov 03 '17 at 11:11
That still won't work. What about `'aaabbbcccddde'`? Now you have *four* most common characters, but only a, b and c need to be listed. Your code will end up with a random subset. – Martijn Pieters Nov 03 '17 at 11:59
Thanks again. It passes the Hackerrank tests now. – B. M. Nov 03 '17 at 12:10
Yes, and now you have the same answer as several others here. :-) The Counter is no longer that large a differentiator (other than the fast counting code). – Martijn Pieters Nov 03 '17 at 12:13

Python Sorting Dictionary by values can be unstable? (in Hackerrank)

6 Answers6