6

I would like to get the first letter with the maximum occurence of a string.

For instance:

 "google" -> g  
 "azerty" -> a  
 "bbbaaa" -> b

I already have a working code, using OrdererDict() to avoid automatic keys rearangement:

from collections import OrderedDict

sentence = "google"

d = OrderedDict()

for letter in sentence:
    if letter not in d.keys():
        d[letter] = sentence.count(letter)

print(max(d, key=d.get)) # g

but I'm looking for a possible one-liner or more elegant solution (if it's possible).

Note: I already tried to use Counter() but it doesn't work, since dict in python don't remember the order that keys were inserted.

e.g

from collections import Counter

sentence = "bbbaaa"

c = Counter(sentence)
print(c.most_common()[0][0]) # have 50% chances of printing 'a' rather than 'b'.

Bonus question: Can someone explains why OrderedDict() are not the default dictionary behavior in python?

Kruupös
  • 5,097
  • 3
  • 27
  • 43
  • OrderedDict() is slower than dict. – joel goldstick Aug 28 '16 at 16:13
  • Why do you need an OrderedDict at all? if you want to make sure keys aren't over-written use [the setdefault method or a DefaultDict](http://stackoverflow.com/questions/3483520/use-cases-for-the-setdefault-dict-method) and for a "one-liner" you could just reduce the loop into a comprehension – LinkBerest Aug 28 '16 at 16:16
  • See [OrderedDict vs defaultdict vs dict](http://stackoverflow.com/a/19643045/298607) – dawg Aug 28 '16 at 16:16
  • What would `goooogle` and `aaabbbb` desired result be? – dawg Aug 28 '16 at 16:18
  • @dawg `goooogle` -> o and `aaabbbb` -> b – Kruupös Aug 28 '16 at 16:21
  • @JGreenwell, I need OrdererDict to remember when keys were added. For instance, I need to get the first key when the letters have the same amounts of occurence in the string. – Kruupös Aug 28 '16 at 16:28
  • 1
    @MaxChrétien Is the tie broken by the one that occurs first in the word or the first lexicographically. – gowrath Aug 28 '16 at 16:39
  • @gowrath the one that occurs first in the word. – Kruupös Aug 28 '16 at 16:55

5 Answers5

6

The documentation for collections.OrderedDict actually has a recipe for an OrderedCounter:

In [5]: from collections import Counter, OrderedDict

In [6]: class OrderedCounter(Counter, OrderedDict):
   ...:     pass
   ...:

In [7]: OrderedCounter("google").most_common()[0][0]
Out[7]: 'g'
Blender
  • 289,723
  • 53
  • 439
  • 496
  • Damn I fell stupid for not reading the doc entierly. Thanks anyway it's exactly what I was looking for – Kruupös Aug 28 '16 at 16:24
5

Probably not very fast, but one-liner!

>>> s = "aaabbbbc"
>>> sorted(s, key=lambda c: (-s.count(c), s.index(c)))[0]
'b'

Edit

Even shorter, thanks to @Ohad Eytan's comment:

>>> min(s, key=lambda c: (-s.count(c), s.index(c)))
'b'

Benchmark

Feeling bored tonight, so I benchmarked (using timeit) test @Joohwan's most_common_char() solution (mostcc), @Blender's OrderedCounter solution (odict) and my own one liner solution (onelin, using the min variant). The fastest solution was consistently mostcc: up to ~10x faster than onelin for long strings containing few different characters, and up to ~4x faster than odict for very short strings. For short strings or strings with little repeated chars, onelin beats odict (otherwise, it's the reverse). Here are the details (Length=length of the string, #chars=number of different unicode chars to randomly pick from for each char, mostcc=time to execute 10,000 times mostcc, odict=how much longer odict was compared to mostcc, onelin=how much longer oneline was compared to mostcc).

Length  #chars  mostcc odict  onelin
10      10:     0.08s  3.76x  1.61x
10      100:    0.10s  3.57x  1.27x
10      1000:   0.12s  3.12x  1.34x
100     10:     0.43s  1.96x  3.29x
100     100:    0.59s  2.16x  2.18x
100     1000:   0.80s  1.92x  1.72x
1000    10:     3.48s  1.56x  9.79x
1000    100:    3.44s  1.72x  6.43x
1000    1000:   6.55s  1.68x  3.30x
MiniQuark
  • 46,633
  • 36
  • 147
  • 183
  • 1
    Why not `min(s, key=lambda c: (-s.count(c), s.index(c)))`? – Ohad Eytan Aug 28 '16 at 18:01
  • 1
    Nice! I didn't even know `min()` had a `key` argument. Updating my answer. – MiniQuark Aug 28 '16 at 20:31
  • First of all, thanks for this benchmark! Since this topic is now a speed test, could you pass the code you use to output those results, it could be nice for new chalengers to have it. Otherwise, well played to @Joohwan with the backwards iterations! My favorite answer remains OdererCounter since it's the easiest one to understand but still have good performances. – Kruupös Aug 29 '16 at 08:36
  • Glad you like the benchmark. Unfortunately I just did this in a console, I did not save the code. Basically I created a function for each, then I ran `timeit.timeit()` on random strings of various lengths and #chars, for each function. Something like: `t = timeit.timeit("mostcc(s)", number=10000, globals={"mostcc":mostcc, "s": s})`. To generate random strings I did something like this: `"".join([random.randint(32, n_chars+32) for i in range(length)]`. – MiniQuark Aug 29 '16 at 09:18
3

I am aware that you want a one-liner, but what if you had to repeat this task many times or handle really long sentences? I don't know the exact use case here but it could be worth your time considering the space and time complexity of the algorithm.

In your solution, for example, you are iterating through the sentence many times more than necessary with sentence.count(), which takes O(n * number of unique characters). After that you iterate through the ordereddict once more to find the max (another O(number of unique characters) operation).

In the accepted solution, we end up having to define a new class (which breaks your 1 liner requirement btw) and instantiate new objects with alot of boilerplate code and functionalities you probably won't need every time you want to execute your task.

If you don't mind having a few more lines of code (again, I know this is not what the question is asking), we can build a reusable function which only has to iterate through the string once and use constant and minimal space:

from collections import defaultdict


def most_common_char(sentence):
    if not sentence:
        return ''

    max_count = 1
    max_char = sentence[-1]
    char_counts = defaultdict(int)
    char_counts[max_char] = 1

    for i in xrange(len(sentence) - 2, -1, -1):
        char = sentence[i]
        char_counts[char] += 1
        if char_counts[char] >= max_count:
            max_count = char_counts[char]
            max_char = char

    return max_char

We keep track of the character with the max count as we go through the string and spit it out at the end of the iteration. Note that we iterate backwards since you want the letter that comes first (i.e. last updated wins).

Joohwan
  • 2,374
  • 1
  • 19
  • 30
  • Most of the time, I prefer using official library when possible. It makes the code clearer and less error-free. But I've got your point, my solution isn't the best in term of complexity. I'm not sure for the validate answer though. I will make a benchmark of all the answers tomorrow to compare which one is the best. Anyhow, thanks for your time. – Kruupös Aug 28 '16 at 17:34
  • 1
    @Joohwan I benchmarked a few solutions, yours won by a large margin (see my answer for details). :) – MiniQuark Aug 28 '16 at 21:15
  • @Joohwan, after an attempt to minify your function (which I failed ^^') I still removed 4 lines. Explanation: using `for char in reverse(sentence)` instead of `for i in xrange(len(sentence) - 2, -1, -1):`. This allows you to remove `char = sentence[i]` because you already got the `char`. Because now you start with the last character of the string you can remove the first `char_counts[max_char] = 1` and replace `max_char = sentence[-1]`by `max_char = ''` , so that you can deletethe `if not sentence: return ''`. – Kruupös Aug 29 '16 at 19:22
2

You can use Counter() together with next() to find the first letter that meets the condition:

>>> s = "google"
>>> c = Counter(s)
>>> next(x for x in s if c[x] == c.most_common(1)[0][1])
'g'
Ohad Eytan
  • 8,114
  • 1
  • 22
  • 31
  • Since you have `c.most_common(1)` that will return one of the most common elements but not the rest with the same count. So the entire predicate of `next(x for x in s if c[x] == c.most_common(1)[0][1])` suffers from the same random return... – dawg Aug 28 '16 at 16:59
  • 1
    @dawg no it's not. The `c.most_common(1)[0][1]` will give just the desired **value** and that isn't random at all – Ohad Eytan Aug 28 '16 at 17:02
1

You can also fix the issue you describe at the end of your question about using Counter by having the resulting list sorted by various attributes: firstly count, secondly lexicographical order like this:

from collections import Counter

sentence = "google"

c = Counter(sentence)
print(sorted(c.most_common(), key = lambda x: (-x[1], sentence.index(x[0]))))

Output:

=> [('g', 2), ('o', 2), ('l', 1), ('e', 1)]

Just for fun:

Golfed Version:

# If your sentence is s:
print(sorted(collections.Counter(s).most_common(),key=lambda x:(-x[1],s.index(x[0]))))
gowrath
  • 3,136
  • 2
  • 17
  • 32