Remove duplicates in list of object with Python

Question

I've got a list of objects and I've got a db table full of records. My list of objects has a title attribute and I want to remove any objects with duplicate titles from the list (leaving the original).

Then I want to check if my list of objects has any duplicates of any records in the database and if so, remove those items from list before adding them to the database.

I have seen solutions for removing duplicates from a list like this: myList = list(set(myList)), but i'm not sure how to do that with a list of objects?

I need to maintain the order of my list of objects too. I was also thinking maybe I could use difflib to check for differences in the titles.

__leaving the original__ , what this mean ? because if like you said you want to __maintain order__ of the list so the first occurrence of a duplicate object in the list will be the original right ? — mouad, Nov 12 '10 at 21:56
yes, I just meant I want to remove all of the duplicates except for the original. @S.Lott, I did search a ton and I didn't find anything, that's why I came here. Can you cite an example that address this exact problem? I would be happy to see it. — imns, Nov 12 '10 at 22:25
http://stackoverflow.com/search?q=Remove+duplicates+in+list+of+object+with+Python. — S.Lott, Nov 15 '10 at 16:51

vonPetrushev · Accepted Answer · 2014-02-11T16:40:36.083

The set(list_of_objects) will only remove the duplicates if you know what a duplicate is, that is, you'll need to define a uniqueness of an object.

In order to do that, you'll need to make the object hashable. You need to define both __hash__ and __eq__ method, here is how:

http://docs.python.org/glossary.html#term-hashable

Though, you'll probably only need to define __eq__ method.

EDIT: How to implement the __eq__ method:

You'll need to know, as I mentioned, the uniqueness definition of your object. Supposed we have a Book with attributes author_name and title that their combination is unique, (so, we can have many books Stephen King authored, and many books named The Shining, but only one book named The Shining by Stephen King), then the implementation is as follows:

def __eq__(self, other):
    return self.author_name==other.author_name\
           and self.title==other.title

Similarly, this is how I sometimes implement the __hash__ method:

def __hash__(self):
    return hash(('title', self.title,
                 'author_name', self.author_name))

You can check that if you create a list of 2 books with same author and title, the book objects will ~~be the same (with is operator) and~~ equal (with == operator). Also, when set() is used, it will remove one book.

EDIT: This is one old anwser of mine, but I only now notice that it has the error which is corrected with strikethrough in the last paragraph: objects with the same hash() won't give True when compared with is. Hashability of object is used, however, if you intend to use them as elements of set, or as keys in dictionary.

Nice, I didn't know about `__hash__` and `__eq__`. Any examples on how to implement `__eq__`? — imns, Nov 14 '10 at 17:02
you need to make sure the class is same or the field wont be available so eq also needs to do `self.__class__ == other.__class__ and self.author_name==other.author_name\ and self.title==other.title ` — Mahesh, Jun 21 '19 at 08:13
Do we know which one of the "duplicates" is kept and which one is discarded? Following the book example, let's say they have a field publication_date (the same book can have multiple editions, hence multiple publication dates). If the list is initially ordered by most recent to oldest, and I remove duplicates using this technique (disregarding the publication_date when defining ```__eq__```), do I know which one is kept and which one is discarded? — Marco Castanho, Dec 22 '21 at 16:59
The OP requested order be preserved, but `set` does not maintain order. — Noldorin, Sep 21 '22 at 13:52

score 27 · Answer 2 · answered Nov 12 '10 at 21:43

27

Since they're not hashable, you can't use a set directly. The titles should be though.

Here's the first part.

seen_titles = set()
new_list = []
for obj in myList:
    if obj.title not in seen_titles:
        new_list.append(obj)
        seen_titles.add(obj.title)

You're going to need to describe what database/ORM etc. you're using for the second part though.

answered Nov 12 '10 at 21:43

aaronasterling

68,820
20
127
125

I'm using mysql with sqlobject. – imns Nov 12 '10 at 22:07
@bababa please update the question so that other people see it as well. – aaronasterling Nov 12 '10 at 22:14
@bababa, I don't see a good way to do this using sqlobject (i.e. without pulling every object from the DB in one query or making one query per object) so I'll wait a while and then post that if somebody that doesn't know sqlobject better than I do doesn't come along. – aaronasterling Nov 12 '10 at 23:49
Just out of curiousity, why did you use a set instead of a dict? isn't dict key checking O(1) as well? – mahmoudafer Nov 07 '21 at 20:05

score 6 · Answer 3 · answered Nov 13 '10 at 02:32

6

This seems pretty minimal:

new_dict = dict()
for obj in myList:
    if obj.title not in new_dict:
        new_dict[obj.title] = obj

answered Nov 13 '10 at 02:32

hughdbrown

47,733
20
85
108

score 5 · Answer 4 · answered Nov 24 '21 at 22:45

If you can't (or won't) define __eq__ for the objects, you can use a dict-comprehension to achieve the same end:

unique = list({item.attribute:item for item in mylist}.values())

Note that this will contain the last instance of a given key, e.g. for mylist = [Item(attribute=1, tag='first'), Item(attribute=1, tag='second'), Item(attribute=2, tag='third')] you get [Item(attribute=1, tag='second'), Item(attribute=2, tag='third')]. You can get around this by using mylist[::-1] (if the full list is present).

score 3 · Answer 5 · answered Dec 13 '22 at 02:54

For non-hashable types you can use a dictionary comprehension to remove duplicate objects based on a field in all objects. This is particularly useful for Pydantic which doesn't support hashable types by default:

{ row.title : row for row in rows }.values()

Note that this will consider duplicates solely based on on row.title, and will take the last matched object for row.title. This means if your rows may have the same title but different values in other attributes, then this won't work.

e.g. [{"title": "test", "myval": 1}, {"title": "test", "myval": 2}] ==> [{"title": "test", "myval": 2}]

If you wanted to match against multiple fields in row, you could extend this further:

{ f"{row.title}\0{row.value}" : row for row in rows }.values()

The null character \0 is used as a separator between fields. This assumes that the null character isn't used in either row.title or row.value.

I just noticed that this is pretty much the same answer as @Dave but adds a bit more detail, my apologies for the duplicate answer! — Thomas Anderson, Dec 13 '22 at 02:57

score 0 · Answer 6 · answered Nov 26 '18 at 20:54

Both __hash__ and __eq__ are needed for this.

__hash__ is needed to add an object to a set, since python's sets are implemented as hashtables. By default, immutable objects like numbers, strings, and tuples are hashable.

However, hash collisions (two distinct objects hashing to the same value) are inevitable, due to the pigeonhole principle. So, two objects cannot be distinguished only using their hash, and the user must specify their own __eq__ function. Thus, the actual hash function the user provides is not crucial, though it is best to try to avoid hash collisions for performance (see What's a correct and good way to implement __hash__()?).

score 0 · Answer 7 · answered Dec 17 '19 at 15:48

0

I recently ended up using the code below. It is similar to other answers as it iterates over the list and records what it is seeing and then removes any item that it has already seen but it doesn't create a duplicate list, instead it just deletes the item from original list.

seen = {}
for obj in objList:
    if obj["key-property"] in seen.keys():
        objList.remove(obj)
    else:
        seen[obj["key-property"]] = 1

answered Dec 17 '19 at 15:48

binW

13,220
11
56
69

This only works if the objList contains objects that are comparable (i.e. implementing the __eq__ method). For more information see https://stackoverflow.com/a/11456817/290588 Creating a deduplicated list would work for objects that do not implement __eq__. – LietKynes Jun 03 '21 at 10:36

score -3 · Answer 8 · answered Nov 20 '16 at 06:55

-3

If you want to preserve the original order use it:

seen = {}
new_list = [seen.setdefault(x, x) for x in my_list if x not in seen]

If you don't care of ordering then use it:

new_list = list(set(my_list))

answered Nov 20 '16 at 06:55

Amir

8,821
7
44
48

score -14 · Answer 9 · answered Mar 17 '11 at 12:09

-14

Its quite easy freinds :-

a = [5,6,7,32,32,32,32,32,32,32,32]

a = list(set(a))

print (a)

[5,6,7,32]

thats it ! :)

answered Mar 17 '11 at 12:09

Spiderman

1,969
1
14
14

14

Cannot do this on a list that contains objects. – Brad Bird Sep 21 '14 at 00:00

Remove duplicates in list of object with Python

9 Answers9

Linked