I have a list of tuples where each tuple is structured like this : (Name, Age, City)
. I have, at most, about 30 tuples in my list.
There are no duplicates. However, sometimes, Name and Age are duplicated.
Example input would be something like this :
lst = [("Dave", 20, "Dublin"), ("Dave", 20, "Paris"), ("Lisa", 20, "Monaco"), ("Lisa", 20, "London"), ("Frank", 56, "Berlin"), ("Frank", 40, "Berlin")]
I would like to remove partial duplicates, where the subset would be Name and Age but not City. Ideally I'd like to keep the first duplicate. I guess an example would make it easier to understand :
Expected output :
expected_lst = [(Dave, 20, Dublin), (Lisa, 20, Monaco), (Frank, 56, Berlin), (Frank, 40, Berlin)]
Dave's and Lisa's duplicates were removed, but not Frank since the Age does not match.
What I have tried so far :
I checked these posts :
- Python remove partial duplicates from a list
- Removing elements that have consecutive partial duplicates in Python
- Efficiently remove partial duplicates in a list of tuples
But they do not seem to match what I'm asking for and I didn't manage to understand how to apply the solutions to my case.
I did find a solution that seems to work, which is to convert my list to a pandas DataFrame and then drop duplicates using the drop_duplicates()
function and its subset
parameter :
df = pd.DataFrame(lst, columns= ["Name", "Age", "City"]).drop_duplicates(subset=(["Name", "Age"]))
And then using itertuples to convert it back to a list.
expected_lst = list(df.itertuples(index=False, name=None))
However, I do not need pandas for any of the other steps of my code. Changing the type of my data seems a bit "much".
I was therefore wondering if there was a better way to get my expected output, that would maybe either be quicker or shorter to write ? I'm not an expert but I assume that converting a list to a pandas DataFrame and then back to a list is not very efficient ?