Better solution for removing duplicates from a Python list

Question

I have two functions to remove duplicates from a list:

def solution1(a: List):
    seen = set()
    result = []
    for item in a:
        if item not in seen:
            seen.add(item)
            result.append(item)
    return result

def solution2(a: List):
    result = []
    for item in a:
        if item not in result:
            result.append(item)
    return result

I got solution1 from this SO question. It's time and space complexities are O(n).

I believe solution2 also has a space complexity of O(n). What is the time complexity? The for loop is O(n) but I'm not sure about the in statement because the length of result isn't as big as the input a. Is it also just O(n)?

Lastly, if both solutions have the same time and space complexities, is either considered "better"? Which should I use in an interview?

It looks like solution1 uses more space (maybe O(2n) but I know we're supposed to leave off the constant in front) and completes in shorter time (testing for set membership is faster than testing for list membership).

Consider the case where the input list is all unique. There's no question that solution2 will be O(n^2) in time for that case. — Mark Ransom, Sep 23 '20 at 03:48
btw There is no O(2n); you're mixing concepts. Average number of computations can be expressed as 2n, but but asymptotic analysis is expressed as just the variable: O(n), O(n^2), O(lg n). — kmiklas, Sep 23 '20 at 03:50
why dont you just convert the list into a set and back to a list. It will remove duplicates. — Joe Ferndz, Sep 23 '20 at 03:52
Even though `result` is smaller than `a`, lookups in `result` still grow linearly with `n`, therefore `solution2` as a whole is O(n^2). If you want the fastest solution and show that you understand how to leverage Python's strengths: `result = list(set(a))` is all you need (assuming you don't care about order) - otherwise, `solution1` is OK. — Grismar, Sep 23 '20 at 03:52
`O(2N)` *is* `O(N)`. In any case, your second solution is quadratic time, your first solution is linear time. Generally, the set-based solution is the way to go — juanpa.arrivillaga, Sep 23 '20 at 03:56

VietHTran · Accepted Answer · 2020-09-23T04:06:35.863

2

I agree that solution1 space complexity is O(2n) but it can be approximated to O(n) which is roughly the same as solution2 in terms of space.

However, in terms of time efficiency, solution1 should be better than solution2 due to the lookup time for set data structure (if item not in seen:) is roughly O(1) while lookup time for regular python list (if item not in result:) is O(n). Therefore, It would be better if you use solution1 in an interview

You can see here for more information on time complexity of Python's data structure https://wiki.python.org/moin/TimeComplexity

edited Sep 23 '20 at 04:06

answered Sep 23 '20 at 03:52

VietHTran

2,233
2
9
16

2

Saying something uses more space "O(2N)" vs "O(N)" fundamentally misuses big-oh notation, which by definition is concerned with asymptotic behavior that is "at most a positive constant multiple" of some function. IOW O(2N) *is* O(N) – juanpa.arrivillaga Sep 23 '20 at 04:02
@juanpa.arrivillaga You're right. I've updated my answer according to your suggestion – VietHTran Sep 23 '20 at 04:07

zmike · Answer 2 · 2020-09-23T04:01:18.443

1

The second answer is actually greater than O(n) - it's more like O(n^2). This is because the time complexity of if item not in result depends on the number of elements in the list result i.e. it's O(n). See this answer for an explanation: Complexity of *in* operator in Python.

This is why the first answer is better: it uses a set or a dictionary, whose time complexity is (at best) constant time.

edited Sep 23 '20 at 04:01

answered Sep 23 '20 at 03:53

zmike

1,090
10
24

2

As 'm' grows linearly with 'n', "O(nm)" is just a confusing way of saying O(n^2) here. – Grismar Sep 23 '20 at 03:55

Joe Ferndz · Answer 3 · 2020-09-23T04:17:13.537

1

using set to remove duplicates

Here's one way to do it:

dups = [10,20,30,40,50,10,20,50,60,80,90]
print (dups)
remove_dups = list(set(dups))
print(remove_dups)

The output will look like this:

Original list:

[10, 20, 30, 40, 50, 10, 20, 50, 60, 80, 90]

Dups removed list:

[40, 10, 80, 50, 20, 90, 60, 30]

using minimal look back to remove dups

I also think this may reduce the lookup and be faster than a full scan of the list while preserving the order.

[a for i,a in enumerate (x) if a not in x[:i]]

Output is as follows:

[10, 20, 30, 40, 50, 60, 80, 90]

edited Sep 23 '20 at 04:17

answered Sep 23 '20 at 03:54

Joe Ferndz

8,417
2
13
33

Your solution is not guaranteed to preserve order though - the OP's solution is. – Grismar Sep 23 '20 at 03:55
@Grismar depends on the version of Python. The latest couple will preserve insertion order of dictionaries and sets. – Mark Ransom Sep 23 '20 at 03:59
What are the space and time complexities here? Looks like `O(1)` space and `O(n^2)` time. – Johnny Metz Sep 23 '20 at 04:03
1

@MarkRansom no, `set` objects do not preserve order in any Python version – juanpa.arrivillaga Sep 23 '20 at 04:03
@JohnnyMetz no, it's O(N) space and O(N) time. It's essentially the same as your first solution but it doesn't maintain order – juanpa.arrivillaga Sep 23 '20 at 04:03
If the order needed to be preserved, this wouldn't work. But if not, I think this would be the best solution, correct? – Johnny Metz Sep 23 '20 at 04:10
What do you mean by "best"? – juanpa.arrivillaga Sep 23 '20 at 04:15
@JohnnyMetz, agree. If the order is not an issue, using set will be the best option. Also, I added another option where we look back only the items we have iterated thru. that way the lookup is faster. – Joe Ferndz Sep 23 '20 at 04:15
@juanpa.arrivillaga interesting. I thought the insertion order guarantee would hold for both dictionaries and sets, but evidently it's only for dictionaries. I wonder why the difference? – Mark Ransom Sep 23 '20 at 04:16
@MarkRansom because `set` objects are not implemented as `dict` objects with empty values. There's a *lot* going on with Python `dict` objects. [Here's](https://www.youtube.com/watch?v=p33CVV29OG8) a pretty good discussion on the current state of the dict, by one of its maintainers. – juanpa.arrivillaga Sep 23 '20 at 04:18
1

`[a for i,a in enumerate (x) if a not in x[:i]]` nah, `x[:i]` creates a copy, so you are losing any efficiency gains by slicing it. Botton line, if you want to do this efficiently, use a `set`, don't use membership testing in a list in a loop – juanpa.arrivillaga Sep 23 '20 at 04:20

Better solution for removing duplicates from a Python list

3 Answers3

using set to remove duplicates

using minimal look back to remove dups