Compare complexities of traversing a list v/s using set in python code

Question

I want to ask the difference in time complexities of the following 2 approaches:

Without using set as in CODE 1
Using set as in CODE 2

Function remove_duplicates() takes in an UNSORTED list and removes elements of the list that are the same.

CODE 1

def remove_duplicates(input_list):
    result = []
    k = -1      # to store the index of current element being checked
    for i in input_list:
        k += 1
        flag = False
        for j in range(0,k,+1):
            if(input_list[j] == i):
                flag = True
        if(flag==False):
            result.append(i)
        else:
            continue
    return result

CODE 2

def remove_duplicates(input_list):
    return list(set(input_list))

You can **sort** your values before executing this function and then remove duplicates in a row. — Andrew Dunai, Dec 11 '14 at 17:51
http://stackoverflow.com/questions/1532819/algorithm-efficient-way-to-remove-duplicate-integers-from-an-array — ComputerFellow, Dec 11 '14 at 17:53

score 1 · Answer 1 · answered Dec 11 '14 at 17:51

1

Simply use list(set(input_list))

answered Dec 11 '14 at 17:51

Paul Lo

6,032
6
31
36

Which one will have better time complexity? Traversing the list as in my code or using set ? – akshaynagpal Dec 11 '14 at 18:00
Yours is O(n^2), I believe the time complexity of the internal implementation of python's set() would be something like O(n) with space complexity O(n) as well. – Paul Lo Dec 11 '14 at 18:13

score 1 · Accepted Answer · answered Dec 11 '14 at 18:18

Since you are asking for complexity, let’s take a look at your first solution:

The outer loop will run n times, with n being the length of the list. For each iteration, you are then iterating again over the first k elements. So if we just count all the inner loops, we have like n + (n-1) + (n-2) + … + 2 + 1 iterations. This is O(n²) though.

To check the set solution, we have to understand how this works. Creating a set from a list will iterate the list exactly once. For each element we find in the list, that element will be added to the set. Adding to a set is done in (average) constant time. So in total, this is O(n).

Another solution would be to sort the list first, and then iterate once over it to find duplicates. So we would have X + O(n) where X is the complexity of the sort algorithm. Since we can’t sort faster than O(n), that complexity will determine the complexity of the full algorithm.

As for space complexity, the first and the third can be done in-place so we at most need constant space. The set needs O(n) on worst case.

How will creating a set from a list iterate the list only once? The internal mechanism of set creation may be following the same procedure of traversal and search of unique elements? — akshaynagpal, Dec 11 '14 at 18:28
Search for [hash table](https://en.wikipedia.org/wiki/Hash_table) and [binary tree](https://en.wikipedia.org/wiki/Binary_tree). Two data structures usually used for implementing sets. They allow membership tests in O(1) and O(log(n)) respectively. — 5gon12eder, Dec 11 '14 at 18:32
A set is internally implemented as a hash table, as 5gon12eder said. See also [this page](https://wiki.python.org/moin/TimeComplexity) for more information about the complexity of various built-in data structures in Python. — poke, Dec 11 '14 at 18:55

score 0 · Answer 3 · answered Dec 11 '14 at 18:35

Questions about optimizations and what is "fastest" have a lot of variables, so you should always profile it with your data under the same conditions you expect the code to operate under. Consider the following code:

import timeit
import random
import functools

def remove_duplicates_manual(input_list):
    result = []
    k = -1      # to store the index of current element being checked
    for i in input_list:
        k += 1
        flag = False
        for j in range(0,k,+1):
            if(input_list[j] == i):
                flag = True
        if(flag==False):
            result.append(i)
        else:
            continue

    return result

def remove_duplicates_set(input_list):
    return list(set(input_list))

l = [random.randint(0, 10) for x in xrange(1000)]
rd_manual = functools.partial(remove_duplicates_manual, l)
rd_set = functools.partial(remove_duplicates_set, l)

print(timeit.timeit(rd_manual, number=100))
print(timeit.timeit(rd_set, number=100))

This code produces this output:

3.648878
0.001779

So we can conclude that the set method is generally going to be faster given a list of random integers in the range 0-10, but it may vary for your data and on your system. Also, as mentioned in the comments and other answers, there are also a lot of ways you could tweak your manual algorithm to make it faster.

Additionally, you should note that the set method may not even be possible for some data. Python sets require hashable data types (just like dictionary keys), so if you're sorting a list of mutable data types (like other lists), you won't be able to use list(set(input_list)).

I'm sorry to say that this does not answer the question. (Which, since it is the accepted answer, probably means that the OP wanted to ask something different.) Asymptotic time complexity is *not* the question of what program runs faster for a given input but how execution time *scales* with growing input size. Therefore, an experimental (as opposed to [poke](http://stackoverflow.com/a/27429745/1392132)'s very good theoretical answer) should plot execution times as a function of input sizes. You should see a parabola and a slightly more than lineary growing sequence. — 5gon12eder, Dec 11 '14 at 18:45

Compare complexities of traversing a list v/s using set in python code

3 Answers3