Eliminating duplicates from a list without a second list/module

Question

I was trying to eliminate duplicates from a list such that

2nd list is not required
No modules are used.
order in which the values occur are preserved in the final list.

I could not find a method that satisfies all three conditions.

A little help?

My try: A) using sort function, then the for loop to eliminate elements that were same as that of previous index. Order criterion missed.

B) using count function, however i had to use a list copy using list[:]

What have you tried? What did you expect to work, but didn't? — Grismar, Jan 13 '22 at 03:48
You could just check each item against all items earlier in the same list and delete it if you find a match. This won't be efficient in time, but it won't require a new list or set. — Matthias Fripp, Jan 13 '22 at 03:52
There is an O(n^2) solution to just iterate the rest of the list for each element and checking for equality. — Cresht, Jan 13 '22 at 03:52
Mathias fripp, yeah i tried that but if you write a for loop like that i think you can have some problems eg take 1,1,1,1 list because one element is removed ,so for loop skips certain iterations. See https://www.askpython.com/python/remove-duplicate-elements-from-list-python — Sage of Seven Paths, Jan 13 '22 at 03:54

Grismar · Accepted Answer · 2022-01-13T06:31:00.143

This solution seems to meet your criteria:

xs = [1, 2, 3, 3, 4, 5, 6, 4, 1]


while True:
    for i, x in enumerate(xs):
        try:
            del xs[xs.index(x, i+1)]
            break
        except ValueError:
            continue
    else:
        break


print(xs)

@cresht pointed out the previous version created a new list in a slice - that may be the case, but I've fixed that here, and the new solution is even more brief.

By the way, I prefer the look of the solution offered by @cresht, but that loops through the list twice explicitly while having to do another lookup in .pop() - it'd be worth checking which is faster to pick the winner.

Edit, I tested, mine's a bit faster:

from timeit import timeit


def dedupe(xs):
    while True:
        for i, x in enumerate(xs):
            try:
                del xs[xs.index(x, i+1)]
                break
            except ValueError:
                continue
        else:
            break


def remove_duplicates(arr: list) -> None:
    for i, x in enumerate(arr):
        for j, y in enumerate(arr):
            if x == y and i != j:
                arr.pop(j)


def test_dedupe():
    xs = [1, 2, 3, 3, 4, 5, 6, 4, 1]
    dedupe(xs)


def test_remove_duplicates():
    xs = [1, 2, 3, 3, 4, 5, 6, 4, 1]
    remove_duplicates(xs)


print(timeit(test_dedupe))
print(timeit(test_remove_duplicates))

Result:

1.7659466000004613
2.694205100000545

Both solutions meet the criteria, but this one is just a bit faster. (by default, timeit runs them a million times)

Note that doing something like what @schmulvad proposes is of course the fastest and arguably best solution - but I don't think it meets the criteria of the problem:

def test_simple():
    xs = [1, 2, 3, 3, 4, 5, 6, 4, 1]
    xs = list(dict.fromkeys(xs))


print(timeit(test_simple))

Result:

0.4907456999972055

Fast, but creates a new copy of the original (not a list, but still a copy).

Is the implementation (`dedup`) correct ? If I run with the test sequence `[0, 2, 1, 0, 4, 2, 0, 0, 2, 3]`, I get an output of `[0, 2, 1, 4, 0, 2, 3]`. — Arnab De, Jan 13 '22 at 06:27
Good catch @arnabde, a `break` got lost - adding the break fixes the issue though, the answer should now work correctly for all causes. It, doesn't affect runtimes. The first `break` breaks out of the `for` after removing the first match, starting it over for the next - the second `break` breaks out of the `while` if no duplicates are found — Grismar, Jan 13 '22 at 06:31
@Grismar, isn't enumerate function making a dictionary as a copy? — Sage of Seven Paths, Jan 13 '22 at 08:30
No, enumerate doesn't make a copy, it simply sets up a generator, which yields the index and the element one at a time, but that's similar to having a for loop that loops over the size and allows you access - there is no full copy being made, or a data structure you can access other than the iterator. — Grismar, Jan 13 '22 at 12:53

shmulvad · Answer 2 · 2022-01-13T04:14:47.210

1

As of Python 3.7, standard dict is guaranteed to preserve order. Then you can do something simply like:

list(dict.fromkeys(your_lst))

For lower version of Python, you can do something similar with collections.OrderedDict, however, this would require you to import a (built-in) module.

If you really cannot use any modules and are below Python 3.7, then I think you have to write your own function. This does not have quadratic runtime contrary to the other answers, but it does create a set however. Not explicitly against your requirements, but it might be a dealbreaker. You can consider it for a much better asymptotic runtime at the cost of some extra memory:

def ordered_remove_duplicates(lst):
    seen, idx = set(), 0
    while idx < len(lst):
        elm = lst[idx]
        if elm in seen:
            lst.pop(idx)
        else:
            seen.add(elm)
            idx += 1

    return lst

edited Jan 13 '22 at 04:14

answered Jan 13 '22 at 03:53

shmulvad

636
4
15

1

this creates a second list? – Cresht Jan 13 '22 at 03:53
1

I suppose technically it creates a `set`, not a `list`, but I still think that's against the spirit of the question. Also, the first solution creates a new `dict`, which has a similar problem. (which is not to say that the first solution given here isn't exactly what you should be using if you needed it in production - it's way faster than the solution I provided) – Grismar Jan 13 '22 at 04:11
1

@Grismar Good point, the different solutions in this answer do in some sense go against the spirit of the question. I've added some points about that. I'll still leave them here in case somebody comes across this question and needs it for production, but your answer is undoubtedly better for the specific question being asked. – shmulvad Jan 13 '22 at 04:20
1

Agreed, I referenced your solution in mine, in case someone comes here and only reads the checked answer. – Grismar Jan 13 '22 at 04:24

Cresht · Answer 3 · 2022-01-13T09:23:31.577

1

This is the solution I described in my comment. Let me know if it does not work or if I messed up the code.

def remove_duplicates(arr: list) -> None:
    for i, x in enumerate(arr):
        j = i + 1
        while j < len(arr):
            if x == arr[j]:
                arr.pop(j)
            else:
                j += 1

edited Jan 13 '22 at 09:23

answered Jan 13 '22 at 03:57

Cresht

1,020
2
6
15

Is this implementation correct ? Applying `[2, 0, 3, 4, 1, 0, 0, 2, 3, 2, 4, 3, 2, 1, 0, 3, 2, 4, 0, 3]` to this results in `[2, 3, 4, 1]` – Arnab De Jan 13 '22 at 08:27
Yes it was, nice catch. My edit should fix that. – Cresht Jan 13 '22 at 09:23

Arnab De · Answer 4 · 2022-01-13T09:07:32.273

def dedup_1(xs):
    # i = len(xs) - 1
    l = len(xs)
    for i in reversed(range(l)):
        b = xs.pop(i)
        if b not in xs:
            xs.insert(i, b)
        # i -= 1
        if i < 1:
            break
    return xs

Benchmark:

from timeit import timeit
import random

a = random.choices(range(5), k=20)
print(a)

def dedup_1(xs):
    # i = len(xs) - 1
    l = len(xs)
    for i in reversed(range(l)):
        b = xs.pop(i)
        if b not in xs:
            xs.insert(i, b)
        # i -= 1
        if i < 1:
            break
    return xs

def dedup_2(xs):
    while True:
        for i, x in enumerate(xs):
            try:
                del xs[xs.index(x, i+1)]
                break
            except ValueError:
                continue
        else:
            break
    return xs


print(dedup_1(a[:]))
print(dedup_2(a[:]))

def test_dedup_1():
    dedup_1(a[:])


def test_dedup_2():
    dedup_2(a[:])


print(timeit(test_dedup_1))
print(timeit(test_dedup_2))

Output:

[1, 1, 4, 1, 1, 2, 3, 1, 1, 2, 2, 2, 0, 4, 1, 1, 1, 0, 4, 0]
[1, 4, 2, 3, 0]
[1, 4, 2, 3, 0]
2.8496891420218162
11.819849102001172

Remarks:

This one seems simpler and often faster than @Grismar's. On my benchmark:

For k=20
- Above implementation: 1.0546425660140812
- @Grismar's implementation: 1.6304690400138497
For k=200:
- Above implementation: 23.068929050001316
- @Grismar's implementation: 491.9963251199806

It is not recommended to use a try-catch block for control flow.
This is a crude implementation that runs in O(n^2). This is in line with what @matthiasfrip originally suggested. It is very likely that some improvements / relaxations will be needed for practical purposes.

Eliminating duplicates from a list without a second list/module

4 Answers4