How to get distinct rows from pandas dataframe?

Question

I am having trouble with getting distinct values from my dataframe.. Below is the code i currently use, in line 25(3rd of vier()) is the issue: I would like to show the top 10 fastest drivers based on their average heat(go-kart heat) time.

Input:

HeatNumber,NumberOfKarts,KartNumber,DriverName,Laptime
334,11,5,Monique,00:53.862
334,11,5,Monique,00:59.070
334,11,5,Monique,00:47.832
334,11,5,Monique,00:47.213
334,11,5,Monique,00:51.975
334,11,5,Monique,00:46.423
334,11,5,Monique,00:49.539
334,11,5,Monique,00:49.935
334,11,5,Monique,00:45.267
334,11,12,Robert-Jan,00:55.606
334,11,12,Robert-Jan,00:52.249
334,11,12,Robert-Jan,00:50.965
334,11,12,Robert-Jan,00:53.878
334,11,12,Robert-Jan,00:48.802
334,11,12,Robert-Jan,00:48.766
334,11,12,Robert-Jan,00:46.003
334,11,12,Robert-Jan,00:46.257
334,11,12,Robert-Jan,00:47.334
334,11,20,Katja,00:56.222
334,11,20,Katja,01:01.005
334,11,20,Katja,00:50.296
334,11,20,Katja,00:48.004
334,11,20,Katja,00:51.203
334,11,20,Katja,00:47.672
334,11,20,Katja,00:50.243
334,11,20,Katja,00:50.453
334,11,20,Katja,01:06.192
334,11,13,Bensu,00:56.332
334,11,13,Bensu,00:54.550
334,11,13,Bensu,00:52.023
334,11,13,Bensu,00:52.518
334,11,13,Bensu,00:50.738
334,11,13,Bensu,00:50.359
334,11,13,Bensu,00:49.307
334,11,13,Bensu,00:49.595
334,11,13,Bensu,00:50.504
334,11,17,Marit,00:56.740
334,11,17,Marit,00:52.534
334,11,17,Marit,00:48.331
334,11,17,Marit,00:56.204
334,11,17,Marit,00:49.066
334,11,17,Marit,00:49.210
334,11,17,Marit,00:45.655
334,11,17,Marit,00:46.261
334,11,17,Marit,00:46.837
334,11,11,Niels,00:58.518
334,11,11,Niels,01:01.562
334,11,11,Niels,00:51.238
334,11,11,Niels,00:48.808

Code:

import pandas as pd
import matplotlib.pyplot as plt

#Data
df = pd.read_csv('dataset_kartanalyser.csv')
df = df.dropna(axis=0, how='any')
df = df.join(df['Laptime'].str.split(':', 1, expand=True).rename(columns={0:'M', 1:'S'}))
df['M'] = df['M'].astype(int)
df['S'] = df['S'].astype(float)
df['Laptime'] = (df['M'] * 60) + df['S']
df.drop(['M', 'S'], axis=1, inplace=True)

#Funties
def twee():
    print("Het totaal aantal karts = " + str(df['KartNumber'].nunique())) 
    print("Het aantal unique drivers = " + str(df['DriverName'].nunique()))
    print("Het totaal aantal heats = " + str(df['HeatNumber'].nunique())) 

def drie():
    print("De 10 snelste Drivers obv individuele tijd zijn: ")
    print((df.groupby('DriverName')['Laptime'].nsmallest(1)).nsmallest(10))

def vier():
    print('De 10 snelste Drivers obv snelste heat gemiddelde:')
    print((df.groupby(['DriverName', 'HeatNumber'])['Laptime'].mean().round(3)).nsmallest(10))

print(df)
     HeatNumber  NumberOfKarts KartNumber DriverName  Laptime
0           334             11          5    Monique   53.862
1           334             11          5    Monique   59.070
2           334             11          5    Monique   47.832
3           334             11          5    Monique   47.213
4           334             11          5    Monique   51.975
...         ...            ...        ...        ...      ...
4053        437              2         20       luuk   39.678
4054        437              2         20       luuk   39.872
4055        437              2         20       luuk   39.454
4056        437              2         20       luuk   39.575
4057        437              2         20       luuk   39.648

Output:

DriverName   HeatNumber
giovanni     411           26.233
ryan         411           27.747
giovanni     408           27.938
papa         394           28.075
guus         406           28.998
Rob          427           29.371
Suus         427           29.416
Jan-jullius  394           29.428
Joep         427           29.934
Indy         423           29.991

The output i get i almost correct, expect that the driver "giovanni" occurs twice. I would like to only show the fastest avg heat time for each driver. Anyone who know how to do this?

Please look at [here](https://stackoverflow.com/questions/36106490/how-to-get-unique-values-from-multiple-columns-in-a-pandas-groupby) — Wtow, May 22 '22 at 12:33
Yes i tried using these methods but i cant get the syntax right and i get a load of errors.. — Mathijs, May 22 '22 at 12:55
You need to provide a **minimal** reproducible example input. We have no clue what is in your CSV file. — mozway, May 22 '22 at 13:03
Great tip haha, thank you for helping me out. Check edited post! :) — Mathijs, May 22 '22 at 14:09
may you give the input as csv file, json or as a text ? We can not load the data by this way. — Wtow, May 22 '22 at 14:49
Sorry i am kinda new to stackoverflow, i think i added it the correct way now. — Mathijs, May 22 '22 at 14:59
If you have an solution please create an answer to the question so i can rep you and it may help other people. :) @Wtow — Mathijs, May 22 '22 at 15:29
Because i would like to show the top 10 drivers who have the fastest average heat time. Top 10 could be of the same heat. — Mathijs, May 22 '22 at 19:32

gal peled · Answer 1 · 2022-05-22T13:03:54.887

1

ok so add drop_duplication on a column like this just need to add sort as well df.sort_values('B', ascending=True)
.drop_duplicates('A', keep='first')

(df.groupby(['DriverName', 'HeatNumber'])['Laptime'].mean().round(3)sort_values('Laptime', ascending=True).drop_duplicates('DriverName', keep='first')).nsmallest(10))

edited May 22 '22 at 13:03

answered May 22 '22 at 12:34

gal peled

467
5
8

But i done need the max for heatnumber, i need the min for avg heat time(based on the laps within the heat). Max heatnumber would indeed give the desired output but that is only due to the fact that heat 411 was giovanni's fastest.. – Mathijs May 22 '22 at 12:49
I change the script to drop_duplication on driver_name next time please add a ways to reproduce data so I can test my code – gal peled May 22 '22 at 13:04
I add some more data, but when i try to run your code i get this error: ValueError: No axis named Laptime for object type Series. If i remove the sort_values part i get the following error: TypeError: drop_duplicates() got multiple values for argument 'keep' – Mathijs May 22 '22 at 14:06

score 0 · Accepted Answer · answered May 22 '22 at 16:12

0

You group the datas by Drivername and HeatNumber. See the HeatNumbers, one of them is 411 and another is 408. Because of that pandas understand they are exactly different. If you equals them, they will be one.

answered May 22 '22 at 16:12

Wtow

98
1
8

Yes but if i groupby DriverName only, it wont show me the avg of the heat. What do you mean by equals them? – Mathijs May 22 '22 at 16:15
Think like that, you have two product, id of one of them is 411 and another is 408, so they are different product but their name is same. If theye were same product, their ids would be same. If HeatNumber is not something like id, it may independed on DriverName, you should not use it, just use DriverName. If you don't want to select them by DriverName, provide a unique id for each group. I hope I could explain myself. Ask whatever you could not understand. – Wtow May 22 '22 at 16:23
I understand why they both appear. I just need to be able to remove/filter-out the same DriverNames. If i remove HeatNumber, i get the avg Laptime for each driver. I would like to see the top 10 fastest drivers based on avg heat time. – Mathijs May 22 '22 at 16:33

How to get distinct rows from pandas dataframe?

2 Answers2

Linked