Determine if a Python list is 95% the same?

Question

This question asks how to determine if every element in a list is the same. How would I go about determining if 95% of the elements in a list are the same in a reasonably efficient way? For example:

>>> ninety_five_same([1,1,1,1,1,1,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1])
True
>>> ninety_five_same([1,1,1,1,1,1,2,1]) # only 80% the same
False

This would need to be somewhat efficient because the lists may be very large.

@Tim: Finding out which element is the expected one is actually a little tricky. — Thilo, Oct 18 '10 at 09:30
Well, the expected element will necessarily be the distribution's mode. No other value could reach 95%. — Tim McNamara, Oct 18 '10 at 09:36
Not sure calculating the complete distribution will satisfy the efficiency requirement. — Thilo, Oct 18 '10 at 09:39
In the second example, how are you getting the 80% number? I don't understand what you are trying to calculate. Based on my understanding, the second example should be 87.5% the same. (7 out of 8) — recursive, Oct 18 '10 at 15:51

score 16 · Answer 1 · answered Oct 18 '10 at 09:39

16

>>> from collections import Counter
>>> lst = [1,1,1,1,1,1,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]
>>> _, freq = Counter(lst).most_common(1)[0]
>>> len(lst)*.95 <= freq
True

answered Oct 18 '10 at 09:39

SilentGhost

307,395
66
306
293

5

Python really does have some neat tricks hidden in its back pocket. – Tim McNamara Oct 18 '10 at 09:44
Should note that this requires Python version 2.7, which was when `Counter` subclass got added to the `collections` module. – martineau Oct 18 '10 at 14:24
@martineau: it got added to py3.1 and then backported to 2.7, that is to say it's been around for a while. Also, Python 2.7 is a current stable version of Python. – SilentGhost Oct 18 '10 at 14:29
@SilentGhost: I know that and only brought the dependency up because, for various reasons, a number of people using this website aren't or can't use the latest release of Python 2 -- and sometimes don't understand that is why a particular answer won't work for them. – martineau Oct 18 '10 at 18:14
1

@martineau: You could use http://code.activestate.com/recipes/576611/ in Python ≥2.5. – kennytm Oct 18 '10 at 20:28
It should be noted that this solution requires all elements to be hashable. – kennytm Oct 18 '10 at 20:29
@KennyTM: Comment 1: Nice! I wasn't aware of that one -- now others are, too. Comment 2: Good point, a fact often forgotten in the answers posted here. Fortunately, in practical use, they often are. – martineau Oct 18 '10 at 21:14

score 15 · Accepted Answer · edited May 23 '17 at 11:55

15

Actually, there's an easy linear solution for similar problem, only with 50% constraint instead of 95%. Check this question, it's just a few lines of code.

It will work for you as well, only in the end you check that selected element satisfies 95% threshold, not 50%. (Although, as Thilo notes, it's not necessary if currentCount >= n*0.95 already.)

I'll also post Python code from st0le's answer, to show everybody how difficult it is.

currentCount = 0
currentValue = lst[0]
for val in lst:
   if val == currentValue:
      currentCount += 1
   else:
      currentCount -= 1

   if currentCount == 0:
      currentValue = val
      currentCount = 1

If you're looking for explanation, I think Nabb has got the best one.

edited May 23 '17 at 11:55

Community

1
1

answered Oct 18 '10 at 09:41

Nikita Rybak

67,365
22
157
181

+1. O(N). Looking at all the other answers should settle the argument whether this problem was "trivial". – Thilo Oct 18 '10 at 09:43
Excellent animated demo here: http://www.cs.utexas.edu/~moore/best-ideas/mjrty/index.html – Thilo Oct 18 '10 at 09:48
3

After this, you need to do another pass to ensure that majority indeed is 95% (except for the cases where this can already be deduced from the final value of currentCount). – Thilo Oct 18 '10 at 09:50
@larsmans Yep, no solution is perfect. (And I've already mentioned in the post that it needs to be checked whether given frequency rate is satisfied.) Your algorithm is nice too, even though it uses considerable amount of additional memory and will probably be slower. – Nikita Rybak Oct 18 '10 at 10:11
@Nikita Rybak: your last claim depends on the nature of the input, esp. the number of distinct items and the memory locality of the original list. I gave you a +1 and updated my answer. – Fred Foo Oct 18 '10 at 10:17
1

@larsmans I agree that your solution is a good one (the best you can get without fancy theorems and textbook algorithms). Still, I do think that second pass against input list will be faster than invoking hash n times. (that's, assuming input list is available at all) – Nikita Rybak Oct 18 '10 at 10:32
In most cases, you would not need that second pass. currentCount should be significant enough. – Thilo Oct 18 '10 at 10:32

score 6 · Answer 3 · edited Oct 18 '10 at 15:14

6

def ninety_five_same(lst):
    freq = collections.defaultdict(int)
    for x in lst:
        freq[x] += 1
    freqsort = sorted(freq.itervalues())
    return freqsort[-1] >= .95 * sum(freqsort)

Assuming perfect hash table performance and a good sorting algorithm, this runs in O(n + m lg m), where m is the number of distinct items. O(n lg n) worst case.

Edit: here's an O(n + m), single-pass version (assuming m << n):

def ninety_five_same(lst):
    freq = collections.defaultdict(int)
    for x in lst:
        freq[x] += 1
    freq = freq.values()
    return max(freq) >= .95 * sum(freq)

Memory use is O(m). max and sum can be replaced by a single loop.

edited Oct 18 '10 at 15:14

hughdbrown

47,733
20
85
108

answered Oct 18 '10 at 09:35

Fred Foo

355,277
75
744
836

1

You can replace `lambda: 0` by `int`, it is guaranteed to be initialized to 0. – Björn Pollex Oct 18 '10 at 09:41
The Boyer-Moore algorithm proposed by @Nikita Rybak has O(N) – Thilo Oct 18 '10 at 09:46
While this is a correct solution, I think Thilos solution of breaking early is better. – Hannes Ovrén Oct 18 '10 at 09:48

Thilo · Answer 4 · 2010-10-18T09:35:54.093

This is even less efficient than checking if every element is the same.

The algorithm is roughly the same, go through every element in the list and count those that do not match the expected one (with the extra difficulty of knowing which one is the expected one). However, this time, you cannot just return false when you meet the first mismatch, you have to continue until you have enough mismatches to make up a 5% error rate.

Come to think of it, figuring out which element is the "right" one is probably not so easy, and involves counting every value up to the point where you can be sure that 5% are misplaced.

Consider a list with 10.000 elements of which 99% are 42:

  (1,2,3,4,5,6,7,8,9,10, ... , 100, 42,42, 42, 42 .... 42)

So I think you would have to start out building a frequency table for at least the first 5% of the table.

I like this idea. It is easy to understand and should be quite fast. The tricky part will be finding out the stop condition, but I think that is fairly easy. — Hannes Ovrén, Oct 18 '10 at 09:46
Forget about my answer, use Boyer-Moore's majority vote algorithm outlined by Nikita. — Thilo, Oct 18 '10 at 09:51

score 1 · Answer 5 · answered Oct 18 '10 at 09:37

1

def ninety_five_same(l):
  return max([l.count(i) for i in set(l)])*20 >= 19*len(l)

Also eliminating the problem with with accuracy of float division.

answered Oct 18 '10 at 09:37

Rok

494
6
15

otherwise good, but you do complete count of whole list for every value of plus set production. Very heavy stuff with big list with many different values but small portion of the whole length in total. – Tony Veijalainen Oct 18 '10 at 20:22

score 0 · Answer 6 · answered Oct 18 '10 at 09:44

0

Think about your list as a bucket of red and black balls.

If you have one red ball in a bucket of ten balls, and you pick a ball at random and put it back in the bucket, and then repeat that sample-and-replace step a thousand times, how many times out of a thousand do you expect to observe a red ball, on average?

Check out the Binomial distribution and check out confidence intervals. If you have a very long list and want to do things relatively efficiently, sampling is the way to go.

answered Oct 18 '10 at 09:44

Alex Reynolds

95,983
54
240
345

Problem is that you have not only red and black balls (but potentially hundreds of different colors). And sampling seems very unreliable, considering that there is an O(N) exact solution. – Thilo Oct 18 '10 at 09:56
If you know the number of colors, you can always extend to a multinomial. If your list is billions of elements or longer, for example, sampling a few thousand "balls" can be much more appealing than an O(n) approach that requires passing through every element in the list. – Alex Reynolds Oct 18 '10 at 10:05
How do you know the number of colors? – Thilo Oct 18 '10 at 10:29
If you know a priori what kinds of balls ("objects") go into the bucket ("list") then you can model appropriately. – Alex Reynolds Oct 18 '10 at 21:07

Andrei Damian-Fekete · Answer 7 · 2010-10-20T15:30:29.597

lst = [1,1,1,1,1,1,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]
#lst = [1, 2, 1, 4, 1]
#lst = [1, 2, 1, 4]

length = len(lst)
currentValue = lst[0]
lst.pop(0)
currentCount = 1

for val in lst:
   if currentCount == 0:
      currentValue = val

   if val == currentValue:
      currentCount += 1
   else:
      currentCount -= 1

percent = (currentCount * 50.0 / length + 50)
epsilon = 0.1
if (percent - 50 > epsilon):
    print "Percent %g%%" % percent
else:
    print "No majority"

Note: epsilon has a "random" value, chose something depending on the length of the array etc. Nikita Rybak's solution with currentCount >= n*0.95 won't work, because the value of currentCount differs depending on the order of elements, but the above does work.

C:\Temp>a.py
[2, 1, 1, 4, 1]
currentCount = 1

C:\Temp>a.py
[1, 2, 1, 4, 1]
currentCount = 2

score 0 · Answer 8 · answered Oct 18 '10 at 20:23

sort as general solution probably is heavy, but consider the exceptional well balanced nature of tim sort in Python, which utilize the existing order of the list. I would suggest to sort the list (or copy of it with sorted, but that copy will hurt the performance). Scan from end and front to find the same element or reach scan length > 5 %, otherwise list is 95% similar with the element found.

Taking random elements as candidates and taking count of them by decreasing order of frequency would not probably also be so bad until found count > 95 % or the total of counts goes over 5%.

Determine if a Python list is 95% the same?

8 Answers8