tidy up a list and remove duplicates

Question

Hello I want to tidy up this list to keep only the highest result. Example like: 'M', 'm' and 'O', 'o' are duplicate in the list I want to keep them as their greatest result.

Here is my list :

list_no_ok = [['A', 'a', 'Finished', '30 September 2020  12:46 PM', '8.08'],
           ['B', 'b', 'Finished', '30 September 2020  2:52 PM', '3.46'],
           ['J', 'j', 'In progress', '-', '-'],
           ['K', 'k', 'Finished', '1 October 2020  11:32 AM', '10.00'],
           ['M', 'm', 'Finished', '1 October 2020  2:12 PM', '6.15'],
           ['M', 'm', 'Finished', '1 October 2020  2:20 PM', '9.10'],
           ['N', 'n', 'Finished', '1 October 2020  3:38 PM', '7.69'],
           ['O', 'o', 'Finished', '1 October 2020  5:06 PM', '4.87'],
           ['O', 'o', 'Finished', '1 October 2020  5:37 PM', '5.90']]

and I want this as output:

list_ok = [['A', 'a', 'Finished', '30 September 2020  12:46 PM', '8.08'],
           ['B', 'b', 'Finished', '30 September 2020  2:52 PM', '3.46'],
           ['J', 'j', 'In progress', '-', '-'],
           ['K', 'k', 'Finished', '1 October 2020  11:32 AM', '10.00'],
           ['M', 'm', 'Finished', '1 October 2020  2:20 PM', '9.10'],
           ['N', 'n', 'Finished', '1 October 2020  3:38 PM', '7.69'],
           ['O', 'o', 'Finished', '1 October 2020  5:37 PM', '5.90']]

The problem is, there is a dash in a list and I don't know how to arrange the list while keeping it.

Thanks.

You posted what you want. What **did** you do to get there? How did it not work? [mre]? What is your problem? — Patrick Artner, Nov 28 '21 at 11:15
[removing-duplicates-from-a-list-of-lists](https://stackoverflow.com/questions/2213923/removing-duplicates-from-a-list-of-lists) — Patrick Artner, Nov 28 '21 at 11:17
[removing-duplicates-from-list-of-lists-in-python](https://stackoverflow.com/questions/1143379/removing-duplicates-from-list-of-lists-in-python) — Patrick Artner, Nov 28 '21 at 11:18
So the entries are already in order of increasing timestamps? And duplicates appear consecutively? — Kelly Bundy, Nov 28 '21 at 11:36

score 1 · Answer 1 · answered Nov 28 '21 at 11:19

I think the best way is to just loop over the list. Hopefully this works for you.

list_by_letters = {}

for element in list_no_ok:
    if element[0] in list_by_letters:
        element2 = list_by_letters[element[0]]

        if '-' in element2 or '-' in element:
            continue

        if element[-1] > element2[-1]:
            list_by_letters[element[0]] = element[1:]
            
        continue
    
    list_by_letters[element[0]] = element[1:]

list_ok = [[key]+list_by_letters[key] for key in list_by_letters]

Also, how do you want it to compare between elements that don't have values (the ones with dates) and ones that do? — Finlay, Nov 28 '21 at 11:23

scientist · Accepted Answer · 2021-11-28T12:04:57.767

The Python package pandas (https://pandas.pydata.org/) is a great way to do this, especially when your 2d list gets very large.

To start we would need to import the package:

import pandas as pd

We can put your 2d list in a pandas Dataframe and give it some sensible colums:

list_no_ok = [['A', 'a', 'Finished', '30 September 2020  12:46 PM', '8.08'],
           ['B', 'b', 'Finished', '30 September 2020  2:52 PM', '3.46'],
           ['J', 'j', 'In progress', '-', '-'],
           ['K', 'k', 'Finished', '1 October 2020  11:32 AM', '10.00'],
           ['M', 'm', 'Finished', '1 October 2020  2:12 PM', '6.15'],
           ['M', 'm', 'Finished', '1 October 2020  2:20 PM', '9.10'],
           ['N', 'n', 'Finished', '1 October 2020  3:38 PM', '7.69'],
           ['O', 'o', 'Finished', '1 October 2020  5:06 PM', '4.87'],
           ['O', 'o', 'Finished', '1 October 2020  5:37 PM', '5.90']]
df = pd.DataFrame(data=list_no_ok, columns=["col1", "col2", "status", "date", "value"])
df["value"] = pd.to_numeric(df["value"], errors='coerce')
df = df.sort_values(by=["value"], ascending=False)
df
    col1    col2    status       date                        value
3   K       k       Finished     1 October 2020 11:32 AM     10.00
5   M       m       Finished     1 October 2020 2:20 PM      9.10
0   A       a       Finished     30 September 2020 12:46 PM  8.08
6   N       n       Finished     1 October 2020 3:38 PM      7.69
4   M       m       Finished     1 October 2020 2:12 PM      6.15
8   O       o       Finished     1 October 2020 5:37 PM      5.90
7   O       o       Finished     1 October 2020 5:06 PM      4.87
1   B       b       Finished     30 September 2020 2:52 PM   3.46
2   J       j       In progress  -                           NaN

Here we have put the 2d list in a Dataframe, gave it column names, we converted the column value from string to numbers. We have added errors='coerce' for the rows that are still in progress. This will convert the '-' to NaN (Not a Number). Finally we have sorted it from high to low based on the value.

We have sorted it based value so that we can make use of the drop_duplicates function of pandas:

df = df.drop_duplicates(['col1','col2'],keep='first')
df = df.sort_values(by=['col1','col2'])
df
    col1    col2    status       date                        value
0   A       a       Finished     30 September 2020 12:46 PM  8.08
1   B       b       Finished     30 September 2020 2:52 PM   3.46
2   J       j       In progress  -                           NaN
3   K       k       Finished     1 October 2020 11:32 AM     10.00
5   M       m       Finished     1 October 2020 2:20 PM      9.10
6   N       n       Finished     1 October 2020 3:38 PM      7.69
8   O       o       Finished     1 October 2020 5:37 PM      5.90

Rows get dropped when they have identical col1 and col2. Whenever this happends, it keeps the first occurrence. In this case that is were value is the highest since we sorted on value before we dropped rows. We can go back to your list_ok by sorting again on col1 and col2. If you really want to go back to a 2d list that can be done with:

df = df.fillna('-')
list_ok = df.values.tolist()
list_ok
[['A', 'a', 'Finished', '30 September 2020  12:46 PM', 8.08],
 ['B', 'b', 'Finished', '30 September 2020  2:52 PM', 3.46],
 ['J', 'j', 'In progress', '-', '-'],
 ['K', 'k', 'Finished', '1 October 2020  11:32 AM', 10.0],
 ['M', 'm', 'Finished', '1 October 2020  2:20 PM', 9.1],
 ['N', 'n', 'Finished', '1 October 2020  3:38 PM', 7.69],
 ['O', 'o', 'Finished', '1 October 2020  5:37 PM', 5.9]]

We have used .fillna to convert the NaN back to the string '-'.

Thank you so much ! but it keeps the smallest value and i want the bigger one. — bendour, Nov 28 '21 at 11:50
Hi bendour, you are right. I made a silly mistake somewhere. Please re-read my answer, I've made some minor changes. — scientist, Nov 28 '21 at 12:05

tidy up a list and remove duplicates

2 Answers2