What is the shortest way to drop partial duplicates from a list of tuple in Python without using Pandas?

Question

I have a list of tuples where each tuple is structured like this : (Name, Age, City). I have, at most, about 30 tuples in my list.

There are no duplicates. However, sometimes, Name and Age are duplicated.

Example input would be something like this :

lst = [("Dave", 20, "Dublin"), ("Dave", 20, "Paris"), ("Lisa", 20, "Monaco"), ("Lisa", 20, "London"), ("Frank", 56, "Berlin"),  ("Frank", 40, "Berlin")]

I would like to remove partial duplicates, where the subset would be Name and Age but not City. Ideally I'd like to keep the first duplicate. I guess an example would make it easier to understand :

Expected output :

expected_lst = [(Dave, 20, Dublin), (Lisa, 20, Monaco), (Frank, 56, Berlin),  (Frank, 40, Berlin)]

Dave's and Lisa's duplicates were removed, but not Frank since the Age does not match.

What I have tried so far :

I checked these posts :

But they do not seem to match what I'm asking for and I didn't manage to understand how to apply the solutions to my case.

I did find a solution that seems to work, which is to convert my list to a pandas DataFrame and then drop duplicates using the drop_duplicates() function and its subset parameter :

df = pd.DataFrame(lst, columns= ["Name", "Age", "City"]).drop_duplicates(subset=(["Name", "Age"]))

And then using itertuples to convert it back to a list.

expected_lst = list(df.itertuples(index=False, name=None))

However, I do not need pandas for any of the other steps of my code. Changing the type of my data seems a bit "much".

I was therefore wondering if there was a better way to get my expected output, that would maybe either be quicker or shorter to write ? I'm not an expert but I assume that converting a list to a pandas DataFrame and then back to a list is not very efficient ?

How long are your lists? Pandas is implemented in C and can loop through long lists much faster than Python. But, for small lists the overheads aren't worth it. — MatBailie, Aug 10 '23 at 15:36
@MatBailie my list is about 10 to 30 tuples long, and each tuple has 3 elements. — perly, Aug 10 '23 at 15:40
Would need to reverse the list first, so the first tuple takes precedence: https://trinket.io/python/ea3376af6e — MatBailie, Aug 10 '23 at 15:56

Kenny Ostrom · Accepted Answer · 2023-08-10T16:05:15.033

2

You can use the tuple of the "unique" elements (name, age) as dict key, where the value is the full tuple. Thus the name+age is unique.

In order to ensure you keep the first entry, you need to check if (name, age) is in temp before inserting it. edit: or just reverse the list, like MatBailie said

data = [("Dave", 20, "Dublin"), ("Dave", 20, "Paris"), ("Lisa", 20, "Monaco"), ("Lisa", 20, "London"), ("Frank", 56, "Berlin"),  ("Frank", 40, "Berlin")]

temp = {(name, age) : (name, age, city) for name, age, city in reversed(data)}
for unique_item in temp.values():
    print(unique_item)

('Dave', 40, 'Paris')
('Lisa', 20, 'London')
('Frank', 56, 'Berlin')
('Frank', 40, 'Berlin')

edited Aug 10 '23 at 16:05

answered Aug 10 '23 at 15:59

Kenny Ostrom

5,639
2
21
30

Thank you, it worked ! Does that mean that, by default, Python browses a dictionnary in the reversed order ? For my specific case the order did not matter as much but thank you anyway as this might be useful for other people as well. – perly Aug 11 '23 at 09:07
Whenever you have a duplicate, that means the dictionary key is the same, therefore you will overwrite the previous duplicate value. By going in reverse, you just overwrite the later values with the earliest. My original implementation was to check for a duplicate and discard the later duplicates, rather than saving them. – Kenny Ostrom Aug 11 '23 at 13:24
Oh my bad I was confused and thought this had to do with the order of my values, now I understand what you mean, reversing the list ensures that I only keep the first duplicate. Thank you this was exactly what I needed. – perly Aug 11 '23 at 20:12

Swifty · Answer 2 · 2023-08-10T16:08:49.513

0

You could make use of itertools.groupby, using the first 2 elements of your tuples as a key (you first need to sort the data, since groupby operates on consecutive entries):

from itertools import groupby
filtered_data = [next(g) for k,g in groupby(sorted(data), key=lambda tup: tup[:2])]

# [('Dave', 20, 'Dublin'), ('Frank', 40, 'Berlin'), ('Frank', 56, 'Berlin'), ('Lisa', 20, 'London')]

Of course, this only works if the initial order of tuples doesn't matter to you. Otherwise, @KennyOstrom's answer preserves the original order.

edited Aug 10 '23 at 16:08

answered Aug 10 '23 at 16:01

Swifty

2,630
2
3
21

Thank you so much, it also worked ! Since the order of tuples does not matter to me this is also a very good solution. I had never used itertools.groupby before so I'll definitely keep this in mind. – perly Aug 11 '23 at 09:08
You're welcome; and I'd advise you to study the whole (it's of modest size) `itertools` module, it's very nice. – Swifty Aug 11 '23 at 09:10
Thank you, I accepted Kenny's answer as it was a little shorter and I thought it could be useful for people who might need to keep the order but your answer was also really helpful and I'll definitely check the itertools module as it is not the first time I answer one of my questions with this library :) – perly Aug 11 '23 at 09:12

What is the shortest way to drop partial duplicates from a list of tuple in Python without using Pandas?

2 Answers2