1

I want to build a function that will return True if any two items in a list are the same.

For example, [1,7,3,7,4] should return True and ["one","ONE","One"] should return False.

I need help with which parts of python look for duplicates.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
holaprofesor
  • 271
  • 4
  • 15

3 Answers3

2

Loop over the values and use a set to track what you have already seen. As soon as you see a value again, return True:

def has_duplicates(lst):
    seen = set()
    for elem in lst:
        if elem in seen:
            return True
        seen.add(elem)
    return False

This is very efficient in that it short-circuits; it won't loop over the whole list if a duplicate has been detected early on.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • Wouldn't a list be faster? (`seen = []`) I did a quick `timeit` (`timeit.timeit("s.add(5)", "s = set()")` was `0.100...` and `timeit.timeit("s.append(5)", "s = []")` was `0.0911`) –  Feb 23 '15 at 16:30
  • Why not just do: `if len(set(lst)) != len(lst)` – Gillespie Feb 23 '15 at 16:30
  • @RPGillespie Because that has to iterate over the whole list. –  Feb 23 '15 at 16:31
  • @Reticality because `set(lst)` iterates over the whole list to produce the set. No, using a list for `seen` is going to be *slower*. Don't compare appending speed of sets and lists here, it is *membership testing* that'll kill you, as it is O(N) for lists vs. O(1) for sets. The larger the number of unique values the slower testing against the list is going to be. – Martijn Pieters Feb 23 '15 at 16:33
1

Martijn's answer is the best, but with a few exceptions, this is worth a try.

>>> chk = lambda x: len(l) != len(set(l)) # check the length after removing dupes. 
>>> l = [1,7,3,7,4] 
>>> chk(l)
True
>>> l = ["one","ONE","One"]
>>> chk(l)
False

Note - As Martijn mentions in a comment, this is a slower process.

Community
  • 1
  • 1
Bhargav Rao
  • 50,140
  • 28
  • 121
  • 140
1

Using a collections.Counter dict:

from collections import Counter
def has_dupes(l):
    # if most repeated key count is > 1 we have at least one dupe
    return Counter(l).most_common(1)[0][1] > 1

Or use any:

def has_dupes(l):
    return any(v > 1 for v in Counter(l).values()) 
Padraic Cunningham
  • 176,452
  • 29
  • 245
  • 321
  • 1
    This is slower still than using `len(set(l))` as it requires a O(N) loop over all the values, followed by creating a heap to find the most common element, again O(N). Asymptotically the same as `len(set(l))` but with a higher fixed cost per iteration. – Martijn Pieters Feb 23 '15 at 16:52
  • 1
    @MartijnPieters, yes, but a linear solution, another example of how to do what the OP wants and an introduction to a Counter dict which is a pretty useful tool. If the question was what is the optimal way because I have a huge amount of data then that would be different but it's not. – Padraic Cunningham Feb 23 '15 at 16:54
  • 1
    I'm merely commenting on the performance aspects of the solution, because I do think that matters. Otherwise this whole question would just be a dupe of [Identify duplicate values in a list in Python](http://stackoverflow.com/q/11236006) – Martijn Pieters Feb 23 '15 at 16:57
  • 1
    @MartijnPieters, fair enough but I don't think in this case performance is a concern at all. `has_dupes([1] + list(range(1000000)))` takes `249ms` ` – Padraic Cunningham Feb 23 '15 at 17:00
  • 1
    If it is no concern we may as well dupe this. But there is a *generic* use here for future visitors. Some of those may well be concerned with performance. – Martijn Pieters Feb 23 '15 at 17:05
  • 1
    @MartijnPieters. I am not disputing the extra work in mine I just don't think the OP is overly concerned with performance, I also don't think the title does either. – Padraic Cunningham Feb 23 '15 at 17:08
  • 1
    But all the solutions here *echo the other post*. My point being that if you *only* need to know *if* there is a duplicate, then you can use a single loop and exit early. Otherwise just use any of the techniques in the dupe and turn that into a boolean. – Martijn Pieters Feb 23 '15 at 17:10
  • 1
    @MartijnPieters, well all bar one of the related questions are not exactly great answers. There are n log n and quadratic solutions and the accepted answer use items instead of most_common or just the values, would you recommend `if mylist.count(x) > 1`? – Padraic Cunningham Feb 23 '15 at 17:13
  • `mylist.count(x) > 1` is terrible. I didn't say the options in that post were all that great. But the accepted answer is correct if you wanted to know *which* elements were duplicated. – Martijn Pieters Feb 23 '15 at 17:17