0

um trying to check whether there are duplicate values exists in an integer list by python . this was successful and I found that the execution time getting higher when the size of the list getting increase. How may I improve the run time of the following logic?

def containsDuplicate( nums):
    if len(nums) < 2:
        return False
    cnt = 0
    flag = False
    length = len(nums)
    while cnt < length:
        p = cnt + 1
        while p < length:
            if nums[cnt] == nums[p]:
                flag = True
                break
            p += 1
        cnt += 1
    return flag
namit
  • 6,780
  • 4
  • 35
  • 41
not 0x12
  • 19,360
  • 22
  • 67
  • 133

3 Answers3

12

You could use a set:

>>> lst = [1, 2, 3]
>>> len(set(lst)) != len(lst)
False
>>> lst = [1, 2, 2]
>>> len(set(lst)) != len(lst)
True

EDIT: a faster version, as pointed out in the comments:

>>> lst = [3] + list(range(10**4))
>>> seen = set()
>>> any(x in seen or seen.add(x) for x in lst)
True
>>> seen
set([0, 1, 2, 3])
Vincent
  • 12,919
  • 1
  • 42
  • 64
  • 2
    Exactly what I would do. I would also add an explanation, that as items in hash-set are unique and as every insertion costs O(1), this solution does not only provide the needed results, but it also does it efficiently – SomethingSomething Jun 09 '15 at 07:41
  • 1
    Hi Thank you for the answer , I will give try on this . Wanted to solve it without using sets . – not 0x12 Jun 09 '15 at 07:45
  • 3
    Sorry, this has its merits, but is linear in the total length of the sequence, whereas it could be linear in the length to the first duplicate item. – Ami Tavory Jun 09 '15 at 07:47
  • @Vincent I consequently changed my downvote to an upvote. Great answer. – Ami Tavory Jun 09 '15 at 08:04
2

The reason a set is more efficient is that it forces the entire collection to have unique keys. Therefore, converting the collection to a set will remove any duplicates as a side effect. A dictionary has the same properties, but would be somewhat slower than a set with current Python implementations.

The main inefficiency in your attempt is that you are rescanning the tail of the list for every key you examine. A data structure with unique keys will avoid this; instead, the program will examine the key directly, and see that it is already occupied. See e.g. http://www.pythoncentral.io/hashing-strings-with-python/ for a somewhat more detailed discussion.

If you don't want to use a set or a related data structure (which is a rather irrational complaint -- if this is really so, you should explain why you have this additional requirement) an alternative would be to sort the data and then discard any adjacent duplicates; this should still be faster than repeatedly traversing the tail of the list, in the general case (i.e. the longer the list, the more it costs to traverse it multiple times). See e.g. http://interactivepython.org/runestone/static/pythonds/AlgorithmAnalysis/BigONotation.html for a Pythonic treatment of efficiency analysis.

tripleee
  • 175,061
  • 34
  • 275
  • 318
0

melt the dataframe then count by value. if the grouped results have a count > 1 then duplication has occurred.

df=pd.DataFrame({"a":[1,2,3,4,5,5,6,7,8,1]})
df.reset_index(inplace=True)
df=df.melt()
print(df)
grouped=df.groupby('value').count()
print(grouped)
Golden Lion
  • 3,840
  • 2
  • 26
  • 35