0

Full disclosure - I am a newbie so please be patient with me. I have a data file. I need to sort by the zip_code column first - then I need to calculate the highest score per zip code.

Fname  Lname  Area  Score
Amy    Doe    3    245
Jon    Doe    1    310
Jane   Doe    2    724
Brian  Doe    1    840
Gary   Doe    3    632
Jen    Doe    2   854
Jim    Doe    3   132
Rick   Doe    1   445

My code:

 import pandas as pd
 from pandas import DataFrame, pandas as pd

 file = pd.read_csv('test.dat',delimiter=',' )
 df = DataFrame(file, columns=['Fname','Lname','Score','zip_code'])
 df.sort_values(by=['Area','Score'], inplace=True)
 print(df)

The desired output would be something to this effect:

Fname  Lname  Area  Score
Brian  Doe    1   840-->Winner!
Rick   Doe    1   445
Jon    Doe    1   310
Jen    Doe    2   854-->Winner!
Jane   Doe    2   132
Gary   Doe    3   632-->Winner!
Jim    Doe    3   132
Rick   Doe    3   445

This is what I get it:

    Fname  Lname  Score  Area
0     NaN    NaN    NaN   NaN
1     NaN    NaN    NaN   NaN
2     NaN    NaN    NaN   NaN
3     NaN    NaN    NaN   NaN

I have not figured out how to sum up the column yet. Can you please tell me what I am doing wrong?

  • lookup `.groupby` - groupby zip, sort groups and take max one - you do no summing at all .... – Patrick Artner Jun 17 '19 at 18:59
  • I would sincerely advise you look into your datatypes before proceeding, and perhaps inspecting your csv. It looks like you have a lot of blank values. Your score & area may also be objects as opposed to integers or floats. – Umar.H Jun 17 '19 at 19:32

1 Answers1

0

Try groupby().idxmax():

df.loc[df.groupby('zip_code').Score.idxmax()]

Output:

   First   Last  Score  zip_code
0    Amy  Smith     56     32003
1  Brian  Smith     90     32025
6  Kelly  Jones     20     32080
2    Joe    Doe     90     32084
Quang Hoang
  • 146,074
  • 10
  • 56
  • 74