-1

I have a list of items. My goal is to find the top k most frequent items without using collections.Counter.

Idea #1: I walk through the list, put the item in a dictionary, along with a count. If I encounter the item again, I increment the count. Reverse keys. (item, count) -> (count, List of items), so now the counts point to items with such count. Walk through counts until I have found k items.

Idea #2, perhaps a bit faster: I created a wrapper class "Count" that looks like this:

class Count:
    def __init__(self, data):
        self.data = data
        self.count = 1

    def inc(self):
        self.count += 1
    ... __hash__, __eq__, etc...

So, I walk through the list, and put Count(item) in a dictionary. If I encounter the item again, I increment the count in the Count class. I will then use Quickselect on dictionary.keys() to get the top k items.

Is there a better/faster way?

lorenzocastillo
  • 985
  • 3
  • 13
  • 24
  • 2
    Why do you not want to use `collections.Counter`? – BrenBarn May 31 '16 at 00:19
  • @BrenBarn #1 in other languages there may not be such a class to handle this, so wanted to know how it could be done. #2 wanted to see what clever solution others came up with and see if there's a more efficient route than the two I'm thinking of. – lorenzocastillo May 31 '16 at 00:21
  • Your idea #1 is essentially just reimplementing `collections.Counter`. Your idea #2 is also essentially just reimplementing `collections.Counter` but storing instances of a wrapper class instead of the real data. You can look at the source of `collections.Counter` to see how it works. If you had to do this in a language that didn't have `collections.Counter`, you could use a hash table to implement the equivalent of a dict, and then use that to implement `collections.Counter`. But since the only tag on your question is `python`, it is strange to ask how you would do it in other languages. – BrenBarn May 31 '16 at 00:27
  • If you look at the source of `collections.Counter`, you'll see that it is a pretty simple wrapper around a `dict`. So any solution you create that involves using a `dict` to map objects to their counts is probably not going to be meaningfully different from `collections.Counter`. You should just use `collections.Counter`. – BrenBarn May 31 '16 at 00:29
  • @BrenBarn I didn't want a language-specific algorithm. I have added the 'algorithm' tag to reflect this. – lorenzocastillo May 31 '16 at 00:30
  • 2
    @lorenzocastillo, then remove the python tag, or add other languages! – Merlin May 31 '16 at 00:34
  • @lorenzocastillo: If you want a totally language-general answer, you'll need to provide more details about what language facilities you will or won't assume. The two answers below, for instance, both assume the existence of a `dict`, so they won't work if you're using a language without a built-in mapping/hashtable type. – BrenBarn May 31 '16 at 00:34

4 Answers4

4

You can use a defaultdict or just a dict to do the job comfortably. This essentially resembles your first laid out path.

from collections import defaultdict
import random

data = [random.randint(1,10) for _ in xrange(100)]

d = defaultdict(int)  # means default value is 0
# d = {}

for x in data:  # go through list
    d[x] += 1  # increment counts
    # d[x] = d.get(x, 0) + 1

# sort dict items by value (count) in descending order
sorted_items = sorted(d.items(), key=lambda i: -i[1])
# sorted_items = sorted(d.items(), key=lambda i: i[1], reverse=True)

# extract the keys
sorted_keys = [k for k, v in sorted_items]

# take n best
n = 8
n_most = sorted_keys[:n]

# [9, 5, 10, 3, 6, 2, 4, 1]

This sorts the dict's items (key-value pairs) by value (count) and slices of the first n keys.

user2390182
  • 72,016
  • 6
  • 67
  • 89
  • `sorted(d.items(), key=lambda i: i[1], reverse=True)` would be a little more explicit than `sorted(d.items(), key=lambda i: -i[1])`. – Tim Hoffmann May 31 '16 at 01:08
  • @TimHoffmann Yeah, I had that first, but went with the shorter version. Now, that I have split the last line into 3, I might as well go with the original :) – user2390182 May 31 '16 at 01:24
3

If, as you suggest, you are looking for interesting or optimal algorithms to solve this problem, rather than the way to do it in Python (which is surely to use collections.Counter), a good first step is to google the name of the problem plus "Knuth".

Searching "Knuth order values by frequency" led me to a "Programming Pearls" article in the June 1986 Communications of the ACM, with a literate program by Knuth to print the k most common words in a file in order of frequency.

This particular article was memorable because, after Knuth presented an 8-page elaborately documented solution using custom data structures and clever algorithms, Doug McIlroy critiqued it with a 6-command Unix pipeline to do the same thing, namely

tr -cs A-Za-z '\n' | tr A-Z a-z | sort | uniq -c | sort -rn | sed ${1}q

which transliterates non-alphabetic characters to newlines, squeezing out multiple newlines; changes uppercase to lowercase; sorts; replaces repeated words with a single representative and a count; sorts in reverse order numerically; and outputs the first ${1} lines.

Anyway, what algorithm did Knuth use? He invented a new intricate data structure for the purpose, the hash trie. But the trie properties are useful specifically for strings (it organizes them by prefix). For general data as in your question, using a hash map is more applicable, and is not substantially different from the hash trie.

After scanning through the input and putting each item in the hash trie, or incrementing its count if already present, the trie has to be sorted by frequency. Knuth makes an array of 200 pointers, with the i-th one leading to a linked list of all the words with frequency i, and then an insertion-sorted list of all the words with frequency above 200. Some shenanigans are done to re-use the structure of the trie for these lists, so that additional storage isn't needed.

This is based on his assumption that "almost all of the frequency counts will be small." A more general method would be:

  • Keep track of the maximum count seen while you are adding items
  • Then use max_count instead of 200: make a list L of max_count lists, and go through your hash_map and add each item to L[c], where c is its frequency.

Indeed, in the July 1987 issue, there was a follow-up with yet another literate program doing the same thing, by David Hanson, this one in C with loom (Knuth's was written in Pascal, with WEB.) This one uses a hash map to track the counts, then adds each item with frequency c to a list in L[c]. It wastes a bit of space by making L as long as the total number of words seen, and not determining max_count until populating L, instead of while putting counts in the hash map.

Besides implementation details, the scheme of things is the same as your idea #1: increment the counts in a dictionary; then sort by count. So it's a pretty good bet that this is the optimal strategy.

Nick Matteo
  • 4,453
  • 1
  • 24
  • 35
  • 1
    I think, neither Knuth's solution nor Hanson's is the best and/or easiest to understand. Both of them cannot even process a single Bible, not to mention bigger files. Out of those three 1980s solutions, only McIlroy's solution still works for large files. Better solutions both for didactic and production purposes are listed [here](https://stackoverflow.com/a/56958255/5407270). – Andriy Makukha Jul 10 '19 at 08:28
1

If you don't want to use collections.Counter, you can use a simple dict:

data = ['a', 'b', 'c', 'a', 'b']

counts = {}
for item in data:
    counts[item] = counts.get(item, 0) + 1
print(counts)

which outputs

{'a': 2, 'c': 1, 'b': 2}

or you can also use a collections.defaultdict so the above is simplified to just

import collections
counts = collections.defaultdict(int)
for item in data:
    counts[item] += 1
aldanor
  • 3,371
  • 2
  • 26
  • 26
0

if you want things to work faster for this kind of problem use heapq. min heap for k most frequent word. check this url if you are interested.

Cereal_Killer
  • 304
  • 2
  • 13