2

Let's say I have a list:

l = ['a', 'b', 'c', 'd', 'e', 'f', 'e']

As you can see, indexes 4 and 6 are repeated. My question is: What is the most efficient way to see if the list has something repeated in it?

option 1:

output = len(set(l)) != len(l):

If output is false, then there is a value in there more than once.

option 2:

output = True
for i in l:
  if l.count(i) > 1:
    output = False

If output is false, then there is a value in there more than once.

Questions:

  1. What is the most efficient way of doing this?

  2. How do I calculate the O notation of these two (or more?) options?

Thanks!

AbrahamB
  • 1,348
  • 8
  • 13
  • It's not a duplicate, as that question asks "What is the best way (best as in the conventional way)" and the selected answer states "Not the most efficient, but straight forward and concise" - this question is for the most efficient way to do this. @Mitch – AbrahamB Jun 21 '16 at 09:43
  • 1. for typical data types, option 1 is optimal up to a constant factor; 2. looks like your home work, please show your own considerations – jmster Jun 21 '16 at 09:44
  • 1
    1 looks like O(n) to me, while 2 like O(n^2) – TomekK Jun 21 '16 at 09:45
  • I've been out of school for 5 years, this is simply for my own edification :). Thanks for making me feel young again, though. @jmster – AbrahamB Jun 21 '16 at 09:46
  • Efficiency is going to depend on the likely size of your actual data, and likelihood of repetition. In a vacuum I would always go for option 1 as it's likely to be fast and it's also short. – RemcoGerlich Jun 21 '16 at 09:46
  • 5
    And look at your current 5 answers to the question - not a single answer has justified that theirs is ___the fastest___, and they can nearly all be found in the linked dup. Your best bet is to benchmark them yourself for whatever size list you are using and draw your own conclusions. – miradulo Jun 21 '16 at 09:50
  • @mitch, that is correct. In the 5 minutes since my question was posted, none of the posted answers provide the solution to the questions asked. The self-benchmarking is a good option. I will do this and report back. – AbrahamB Jun 21 '16 at 09:53
  • 2
    Closing it as a dupe of another post. As @Mitch said if you want the best then you certainly need to benchmark. Else the question will become an Opinionated one where in which each of the answerers will try to argue that their answer is the best. – Bhargav Rao Jun 21 '16 at 15:05

5 Answers5

8

About calculating the O() value:

Option 1 does 4 things: create a set, get its length, get the length of the list, and then compare them. Of these, creating the set must be at least O(n) and the others are at most that, so the efficiency is dominated by the set creation. I believe the implementations of sets in Python is such that insertion takes O(1) on average, and thus this should be O(n).

Option 2 contains a loop. Inside the loop, you call l.count, which iterates over the entire list to count the number of times an item occurs. So each iteration is O(n). The loop itself runs for every item in the loop, so n times. The total efficiency is O(n*n).

Whether something faster than option 1 exists depends on the characterics of your real data, its length, likelihood of a duplication, number of different items (if they are all lower case letters, then any list with length > 26 has a duplication, that's really fast to check) and so on. It can't be answered. But O(n) is really hard to beat, if duplications are rare then usually all items will have to be checked, and that is necessarily O(n) already.

RemcoGerlich
  • 30,470
  • 6
  • 61
  • 79
  • Well, on average you'll find the duplicate, in a list with only one duplicate, after trying `n/2` items. If the list contains only 1 repeated value, you'll find the duplicate at index 2. So O(n) isn't necessarily that hard to beat ;-) Otoh, there is also a distinct difference between O(n) python-C-ops and O(n) Python ops... – thebjorn Jun 21 '16 at 10:06
3

Loop and collect seen items in a set.

Note the break as soon as the first duplicate is found. In the extrem (no duplicates) you will loop the list once and build a set containing every list item.

l = ['a', 'b', 'c', 'd', 'e', 'f', 'e']
seen = set()

for x in l:
    if x in seen:
        print("seen '{}' already, done".format(x))
        # As soon as find find the first duplicate, break.
        break
    seen.add(x)

Output:

seen 'e' already, done
  • 1
    Instead of a dict with all values True, why not a set? – RemcoGerlich Jun 21 '16 at 09:47
  • @RemcoGerlich Done that. –  Jun 21 '16 at 09:48
  • This would be a good way to find out which value is duplicated, but I don't need that. I simply would like to learn if any value is duplicated, and what is most efficient to accomplish this task. – AbrahamB Jun 21 '16 at 09:51
  • 1
    @AbrahamB: that you know which value is duplicated is just a side effect. The thing with this implementation is that it breaks the loop when a duplicate is found. If you have a list of a million items with the duplication at the start, this is going to be really really efficient compared to other solutions. But if there is a million item list with no duplication, it's going to be slow because it's a loop written in Python. You need to benchmark with _real data_ to decide. – RemcoGerlich Jun 21 '16 at 09:53
  • @AbrahamB Yes, note the `break` as soon as the first duplicate is found. In the extrem (no duplicates) you will loop the list once and build a set containing every list item. –  Jun 21 '16 at 09:56
  • @LutzHorn, ah, yes. That's true. I bet this would be faster than set(l) because you don't need to move every value into the set. – AbrahamB Jun 21 '16 at 09:59
  • @AbrahamB creating a set from a list is _blazingly_ fast compared to a for loop... – thebjorn Jun 21 '16 at 10:09
2

Option 1 is fast.

Because set method uses hashing and len method takes O(1) time.

Hence it is fastest way anyone could do.

https://wiki.python.org/moin/TimeComplexity

Community
  • 1
  • 1
Shubham Sharma
  • 1,753
  • 15
  • 24
  • 1
    It still needs to build the entire set even if the first two elements of a million item list are repeated. Saying that it's the fastest anyone could do is too much. – RemcoGerlich Jun 21 '16 at 09:48
  • In worst case, early exit won't work @Mitch – Shubham Sharma Jun 21 '16 at 09:53
  • 2
    In some other language this might not be true, but for Python this is definitely the correct answer. Just benchmark it to verify. – thebjorn Jun 21 '16 at 10:02
2

Here is the simplest speed comparison I could come up with.

Option 1 essentially boils down to creating a set from a list:

(dev) go|c:\srv\tmp> python -m timeit "set(range(100))"
100000 loops, best of 3: 5.48 usec per loop

while option 2 boils down to iterating over the list and doing a membership test (and a set add that I'm skipping):

(dev) go|c:\srv\tmp> python -m timeit "for i in range(100): i in set()"
10000 loops, best of 3: 22.1 usec per loop

(dev) go|c:\srv\tmp> python -m timeit "for i in range(50): i in set()"
100000 loops, best of 3: 10 usec per loop

(dev) go|c:\srv\tmp> python -m timeit "for i in range(25): i in set()"
100000 loops, best of 3: 5.2 usec per loop

so, the first duplicate will have to be at most 25% into the list for option 2 to be faster, and adding the second set operation will make the comparison much worse for Option 2.

thebjorn
  • 26,297
  • 11
  • 96
  • 138
0

A solution which gives you all the information in a nice dictionary format would be the following:

from collections import Counter

l = ['a', 'b', 'c', 'd', 'e', 'f', 'e']

# Get how many of each exist.
counts = Counter(l)
# Looks like this:
# Counter({'e': 2, 'd': 1, 'c': 1, 'f': 1, 'a': 1, 'b': 1})

# Print letters which have 2 or more instances.
for i in counts:
    if counts[i] > 1:
        print(i, counts[i])

Once you've got this, accessing individual dictionary elements can be done in constant time. Though printing every duplicate would take O(n) time still. This, as well as the initial cost of the Counter which is also O(n) makes O(2n), approximating to O(n).

Daniel Porteous
  • 5,536
  • 3
  • 25
  • 44