How to check for duplicates with less time in a list over 9000 elements by python

Question

um trying to check whether there are duplicate values exists in an integer list by python . this was successful and I found that the execution time getting higher when the size of the list getting increase. How may I improve the run time of the following logic?

def containsDuplicate( nums):
    if len(nums) < 2:
        return False
    cnt = 0
    flag = False
    length = len(nums)
    while cnt < length:
        p = cnt + 1
        while p < length:
            if nums[cnt] == nums[p]:
                flag = True
                break
            p += 1
        cnt += 1
    return flag

So you just want to know whether there were any duplicates at all? — Paul Rooney, Jun 09 '15 at 07:35
Related: [Find duplicates in an array](http://stackoverflow.com/q/7055508/572670). This thread shows an algorithm to solve it, and explains lower bounds for the problem. — amit, Jun 09 '15 at 07:40
@tttthomasssss set will find all the duplicates and remove it. Then you will compare the length of original list with the length of set. You don't need to remove all the duplicates just to check for the existence of duplicate — Kavin Eswaramoorthy, Jun 09 '15 at 07:49

Vincent · Accepted Answer · 2015-07-19T20:03:10.377

12

You could use a set:

>>> lst = [1, 2, 3]
>>> len(set(lst)) != len(lst)
False
>>> lst = [1, 2, 2]
>>> len(set(lst)) != len(lst)
True

EDIT: a faster version, as pointed out in the comments:

>>> lst = [3] + list(range(10**4))
>>> seen = set()
>>> any(x in seen or seen.add(x) for x in lst)
True
>>> seen
set([0, 1, 2, 3])

edited Jul 19 '15 at 20:03

answered Jun 09 '15 at 07:37

Vincent

12,919
1
42
64

2

Exactly what I would do. I would also add an explanation, that as items in hash-set are unique and as every insertion costs O(1), this solution does not only provide the needed results, but it also does it efficiently – SomethingSomething Jun 09 '15 at 07:41
1

Hi Thank you for the answer , I will give try on this . Wanted to solve it without using sets . – not 0x12 Jun 09 '15 at 07:45
3

Sorry, this has its merits, but is linear in the total length of the sequence, whereas it could be linear in the length to the first duplicate item. – Ami Tavory Jun 09 '15 at 07:47
@Vincent I consequently changed my downvote to an upvote. Great answer. – Ami Tavory Jun 09 '15 at 08:04

score 2 · Answer 2 · answered Jun 09 '15 at 08:15

The reason a set is more efficient is that it forces the entire collection to have unique keys. Therefore, converting the collection to a set will remove any duplicates as a side effect. A dictionary has the same properties, but would be somewhat slower than a set with current Python implementations.

The main inefficiency in your attempt is that you are rescanning the tail of the list for every key you examine. A data structure with unique keys will avoid this; instead, the program will examine the key directly, and see that it is already occupied. See e.g. http://www.pythoncentral.io/hashing-strings-with-python/ for a somewhat more detailed discussion.

If you don't want to use a set or a related data structure (which is a rather irrational complaint -- if this is really so, you should explain why you have this additional requirement) an alternative would be to sort the data and then discard any adjacent duplicates; this should still be faster than repeatedly traversing the tail of the list, in the general case (i.e. the longer the list, the more it costs to traverse it multiple times). See e.g. http://interactivepython.org/runestone/static/pythonds/AlgorithmAnalysis/BigONotation.html for a Pythonic treatment of efficiency analysis.

Thank you , your answer is so valuable – not 0x12 Jun 09 '15 at 08:42 — not 0x12, Jun 09 '15 at 08:42

score 0 · Answer 3 · answered Jun 25 '21 at 16:20

melt the dataframe then count by value. if the grouped results have a count > 1 then duplication has occurred.

df=pd.DataFrame({"a":[1,2,3,4,5,5,6,7,8,1]})
df.reset_index(inplace=True)
df=df.melt()
print(df)
grouped=df.groupby('value').count()
print(grouped)

How to check for duplicates with less time in a list over 9000 elements by python

3 Answers3