5

Considering a trivial implementation of the problem, I am looking for a significantly faster way to find the most common word in a Python list. As part of Python interview I received feedback that this implementation is so inefficient, that it is basically failure. Later, I tried many algorithms I found, and only some heapsearch based solutions are a bit faster, but not overwhelmingly (when scaled to tens of millions of items, heapsearch is about 30% faster; on trivial lengths like thousand, it is almost the same; using timeit).

def stupid(words):
    freqs = {}
    for w in words:
        freqs[w] = freqs.get(w, 0) + 1
    return max(freqs, key=freqs.get)

As this is a simple problem and I have some experience (although I am nowhere algorithms guru or competitive coder) I was surprised.

Of course, I would like to improve my skills and learn that so much better way of solving the problem, so your input will be appreciated.

Clarification for duplicate status: My point is to find out if there is actually much (asymptotically) better solution and other similar questions have picked an answer that is not much better. If this is not enough to make the question unique, close this question, of course.

Update

Thank you all for the input. Regarding the interview situation, I remain with the impression that hand written search algorithm was expected (that may be somewhat more efficient) and/or the reviewer was assessing code from the point of view of another language, with different constant factors. Of course, everyone can have own standards.

What was important for me was to validate if I am totally clueless (I had the impression that I am not) or just usually write not the best possible code. It is still possible that even better algorithm exists, but if it remained hidden for the community here for a few days, I am fine with that.

I am picking the most upvoted answer - it seems fair to do so, even though more than one people privided usefull feedback.

Minor update

It seems that using defaultdict has a noticeable advantage over using the 'get' method, even if it is statically aliased.

Petar Donchev
  • 399
  • 4
  • 9
  • 2
    http://stackoverflow.com/questions/1518522/python-most-common-element-in-a-list – yangjie Jul 08 '15 at 09:06
  • Your code doesn't look "so inefficient" to me... I'm not sure what the interviewer meant. It's linear in complexity and easy enough to read. Any code to count frequencies would need to traverse the whole list. – Alex Riley Jul 08 '15 at 09:09
  • itertools solution is much slower also. – Petar Donchev Jul 08 '15 at 09:17
  • 2
    http://stackoverflow.com/questions/6987285/python-find-the-item-with-maximum-occurrences – Sudipta Jul 08 '15 at 09:19
  • Essentially the same. – Petar Donchev Jul 08 '15 at 09:23
  • Was the interviewer talking about memory efficiency or what exactly, I cannot see how an `0(n)` solution would be inefficient. – Padraic Cunningham Jul 08 '15 at 09:27
  • The interviewer was perhaps looking for a mapreduce solution. – user1952500 Jul 08 '15 at 09:35
  • 1
    One minor optimization is that you can skip the last loop for finding max and do it as part of the upper loop. – user1952500 Jul 08 '15 at 09:39
  • Adding condition and reassignments inside the loop seem to negate the benefits. – Petar Donchev Jul 08 '15 at 09:57
  • It would be trivial to solve this problem using [`collections.Counter`](https://docs.python.org/3/library/collections.html#collections.Counter). It could be many times faster than your version simply because the class is in a built-in library and might have some or all portions of it written in C — although it's essentially just a different implementation of the same algorithm you used. – martineau Jul 08 '15 at 10:12
  • @martineau: in practice `collections.Counter` is slower. – MattH Jul 08 '15 at 10:32
  • "30% faster" and "asymptotically better solution" are different aims. – Colonel Panic Jul 08 '15 at 11:18
  • @ColonelPanic - did you read the question? – SiHa Jul 08 '15 at 11:23
  • Your algorithm seems about as good as could possibly be produced during an interview situation, if it's faster than using Counter but still not good enough then what does that mean? The Python Counter module is "basically a failure"? Perhaps the interviewer misunderstood your code, or he or she was just saying it was bad as a test to see how confident you are in your abilities and how willing you are to defend your work when challenged? – samgak Jul 08 '15 at 14:43
  • It was not a live interview with a person, just a timed online coding session and the feedback was delivered by another person later. – Petar Donchev Jul 08 '15 at 14:46

6 Answers6

2

That sounds like a bad interview question, probably a case of the interviewer expecting a certain answer. It definitely sounds like s/he didn't clearly explain what s/he was asking.

Your solution is O(n) (where n = len(words)), and using a heap doesn't change that.

There are faster approximate solutions...

taleinat
  • 8,441
  • 1
  • 30
  • 44
  • 1
    A hashmap / dict lookup doesn't come for free. In most implementations it's a O(log(n)) which makes this O(n*log(n)). Unless you have infinite memory you essentially have the log(n) factor included. – user1952500 Jul 08 '15 at 09:31
  • 1
    Approximate solution is certainly not what was asked for. – Petar Donchev Jul 08 '15 at 10:02
  • dictionary has O(log n) complexity, the complexity of the solution is O(n log n) – Darwly Jul 08 '15 at 10:11
  • 1
    It's actually `O(n * log m)`, where `m` is the number of distinct words. But assuming the number of distinct words in a language is finite (and actually not very large), I would consider this `O(n)`. – taleinat Jul 08 '15 at 10:35
  • dict in Python is implemented using hash tables, so there is no log(n) factor. Of course the worst case is quadratic due to hash collisions, but [with high probability](http://cs.stackexchange.com/questions/34048/what-is-with-high-probability-on-probabilistic-algorithms) the total runtime is in O(n). In a deterministic setting, there is an O(n log n) lower bound due as per [element distinctness](https://en.wikipedia.org/wiki/Element_distinctness_problem). – Niklas B. Jul 08 '15 at 12:51
  • @NiklasB. In general, in use cases such as this, hash tables need to be resized every once in a while when adding items. Overall they usually require `O(m * log(m))` operations to build, where `m` is the total number of items. This can be avoided if the (approximate) number of items is known in advance, though. – taleinat Jul 08 '15 at 12:57
  • @taleinat That is just not true. The cost of resizing can be charged to the insertions that happened before, so the expected building time is O(m) – Niklas B. Jul 08 '15 at 12:58
  • @taleinat For a proof, see for example http://courses.csail.mit.edu/6.006/spring11/rec/rec07.pdf – Niklas B. Jul 08 '15 at 13:00
  • @NiklasB. That's not a complete proof, since usually more than a single resize happens. But you are correct that overall this can be `O(m)`, my mistake. – taleinat Jul 08 '15 at 13:10
1
from collections import Counter

word_counter = Counter(words)

word_counter is a dictionary with words as keys and frequencies as values, and also has a most_common() method.

DeepSpace
  • 78,697
  • 11
  • 109
  • 154
1

Function calls and searches of global namespace are more expensive.

Your stupid function makes 2 function calls per element in the word list. The second in your max call is entirely avoidable, iterating the keys of a dict and then for each key looking up the value using dict.get is a flagrant inefficiency when you can iterate over key-value pairs.

def stupid(words):
  freqs = {}
  for w in words:
    freqs[w] = freqs.get(w, 0) + 1
  return max(freqs, key=freqs.get)

def most_frequent(words):
  ## Build the frequency dict
  freqs = {}
  for w in words:
    if w in freqs:
      freqs[w] += 1
    else:
      freqs[w] = 1
  ## Search the frequency dict
  m_k = None
  m_v = 0
  for k, v in freqs.iteritems():
    if v > m_v:
      m_k, m_v = k, v
  return m_k, m_v

Using user1952500's single pass suggestion, how does this fare on your large sample sets?

def faster(words):
  freq = {}
  m_k = None
  m_v = 0
  for w in words:
    if w in freq:
      v = freq[w] + 1
    else:
      v = 1
    freq[w] = v
    if v > m_v:
      m_k = w
      m_v = v
  return m_k, m_v

This has the slight advantage of being stable for the multiple most-frequent values.


Comparison of all suggestions using nltk.books to produce a sample:

def word_frequency_version1(words):
  """Petar's initial"""
  freqs = {}
  for w in words:
    freqs[w] = freqs.get(w, 0) + 1
  return max(freqs, key=freqs.get)

def word_frequency_version2(words):
  """Matt's initial"""
  ## Build the frequency dict
  freqs = {}
  for w in words:
    if w in freqs:
      freqs[w] += 1
    else:
      freqs[w] = 1
  ## Search the frequency dict
  m_k = None
  m_v = 0
  for k, v in freqs.iteritems():
    if v > m_v:
      m_k, m_v = k, v
  return m_k, m_v

def word_frequency_version3(words):
  """Noting max as we go"""
  freq = {}
  m_k = None
  m_v = 0
  for w in words:
    if w in freq:
      v = freq[w] + 1
    else:
      v = 1
    freq[w] = v
    if v > m_v:
      m_k = w
      m_v = v
  return m_k, m_v

from collections import Counter
def word_frequency_version4(words):
  """Built-in Counter"""
  c = Counter(words)
  return c.most_common()[0]


from multiprocessing import Pool
def chunked(seq,count):
  v = len(seq) / count
  for i in range(count):
    yield seq[i*v:v+i*v]

def frequency_map(words):
  freq = {}
  for w in words:
    if w in freq:
      freq[w] += 1
    else:
      freq[w] = 1
  return freq

def frequency_reduce(results):
  freq = {}
  for result in results:
    for k, v in result.iteritems():
      if k in freq:
        freq[k] += v
      else:
        freq[k] = v
  m_k = None
  m_v = None
  for k, v in freq.iteritems():
      if v > m_v:
        m_k = k
        m_v = v
  return m_k, m_v

# def word_frequency_version5(words,chunks=5,pool_size=5):
#   pool = Pool(processes=pool_size)
#   result = frequency_reduce(pool.map(frequency_map,chunked(words,chunks)))
#   pool.close()
#   return result

def word_frequency_version5(words,chunks=5,pool=Pool(processes=5)):
  """multiprocessing Matt's initial suggestion"""
  return frequency_reduce(pool.map(frequency_map,chunked(words,chunks)))

def word_frequency_version6(words):
  """Petar's one-liner"""
  return max(set(words),key=words.count)


import timeit
freq1 = timeit.Timer('func(words)','from __main__ import words, word_frequency_version1 as func; print func.__doc__')
freq2 = timeit.Timer('func(words)','from __main__ import words, word_frequency_version2 as func; print func.__doc__')
freq3 = timeit.Timer('func(words)','from __main__ import words, word_frequency_version3 as func; print func.__doc__')
freq4 = timeit.Timer('func(words)','from __main__ import words, word_frequency_version4 as func; print func.__doc__')
freq5 = timeit.Timer('func(words,chunks=chunks)','from __main__ import words, word_frequency_version5 as func; print func.__doc__; chunks=10')
freq6 = timeit.Timer('func(words)','from __main__ import words, word_frequency_version6 as func; print func.__doc__')

Results:

>>> print "n={n}, m={m}".format(n=len(words),m=len(set(words)))
n=692766, m=34464
>>> freq1.timeit(10)
"Petar's initial"
3.914874792098999
>>> freq2.timeit(10)
"Matt's initial"
3.8329160213470459
>>> freq3.timeit(10)
"Noting max as we go"
4.1247420310974121
>>> freq4.timeit(10)
"Built-in Counter"
6.1084718704223633
>>> freq5.timeit(10)
"multiprocessing Matt's initial suggestion"
9.7867341041564941

Notes:

  • I'm cheating with multiprocessing.Pool instance as a kwarg for timing purposes as I wanted to avoid the pool start-up cost and timeit doesn't let you specify clean-up code. This was run on a "quad" cpu VM, I'm sure for some values of input data and cpu counts that multiprocessing will be faster.
  • For the most part return the highest frequency word, which may be random if there's a tie for first place.
  • Approximations of highest frequency may be faster (using sampling), but will be approximate.
  • Version 6 (one-liner) should be ignored for large values of n*m.
MattH
  • 37,273
  • 11
  • 82
  • 84
1

You have to go through all the words at least once, yielding Omega(n). Storing the values you currently have for each different word yields Omega(log n).

If you find a storage(get/set) which is Omega(1) for the different words, you can create a solution with Omega(n). To my knowledge we have only Omega(log n) solutions for such storage(Independent of type: heap, map,tree, dict, set...).

EDIT(check comments): [Your solution is O(n log n) because of the dictionary check] + O(n) because of the max(), making it O(n log n) in total.... which is fine.

To my knowledge this (complexity wise) is good solution. You might get better performance of using different types of storage, like Syntax trees, or heap.. but the complexity should stay the same.

EDIT: From the comment discussion, you can get average and amortized Omega(n) with hashtable.

Darwly
  • 344
  • 1
  • 6
  • 22
  • It's actually O(n) w.h.p because dict is implemented using hash tables. – Niklas B. Jul 08 '15 at 11:56
  • Hashtables have Omega(1) only in amortized, worst case it is O(n), making the whole solution with O(n^2)... check http://stackoverflow.com/questions/9214353/hash-table-runtime-complexity-insert-search-and-delete ... that is why i took the O(log n) making it better complexity solution as it gives O(n log n) – Darwly Jul 08 '15 at 12:38
  • 1
    No, amortized complexity is something entirely different, you probably mean average. However, I was referring to your statement "Your solution is O(n log n) because of the dictionary check", which is not correct, because dictionaries are not implemented using a data structure with logarithmic ops. By the way, "w.h.p" means with high probabilty, i.e. the probability that the runtime is O(n) converges to 1 for n towards infinity. – Niklas B. Jul 08 '15 at 12:49
  • w.h.p- you learn something new every day :).. Are you sure about converging to 1 with n to infinity... As fas as i understand the hash table is highly dependent on the hashtable size, meaning, for large words-sets you need large enough hash-table so the complexity is constant... I will forward in the post to look the comments: – Darwly Jul 08 '15 at 12:56
  • 2
    You typically chose the table size to be within a constant factor of the number of elements, for example twice as large. For dynamic hash tables, you just double the size after you reach a certain threshold – Niklas B. Jul 08 '15 at 13:01
  • i meant: average and amortized (copied from the link i put on earlier). I always thought that hash-tables are O(n) in the worst case.. ty for the info – Darwly Jul 08 '15 at 13:13
0

Your dictionary/counter solution looks good to me. Its advantage is you can parallelise the counting step.

The other obvious algorithm is:

  1. Sort the list
  2. Loop over the list counting repeated values, recording the longest run so far

This has time complexity O(n log n) where n is the length of the list.

Colonel Panic
  • 132,665
  • 89
  • 401
  • 465
  • Using sort method, a marginally better - and slower than MatHH suggestion. Probably a different sort implementation can change that. – Petar Donchev Jul 08 '15 at 11:56
-1

Obviously you need to look at each word in words, so it can only be the search at the end thats the problem. Would it be an option, to keep an extra reference to the most common word? Something like:

def stupid(words):
    freqs = {}
    most = None
    for w in words:
        word_freq = freqs.get(w, 0) + 1
        if most is None or word_freq > most[0]:
            most = (word_freq, w)
        freqs[w] = word_freq
    return most if most is None else most[1]

This would of course use extra space but avoid searching.

reikje
  • 2,850
  • 2
  • 24
  • 44