3

I have a table with the following format:

data = {'City' : ['London', 'Paris', 'Paris','NY' 'London'], 'Distance' : [5, 1, 7, 2, 6]}
df = pd.DataFrame(data)
df

    City    Distance
0   London  5
1   Paris   1
2   Paris   7
3   NY      2
4   London  6

I want to create a table with all rows with a unique 'City', and whenever there are 2 or more rows with the same 'City' value, I would like it to return the one with the lowest 'Distance'. So in this case I wanted a table like this:

City    Distance
London  5
Paris   1
NY      2

I know I can use:

df.groupby('City')

But I don't know what to add to it to return the smallest 'Distance'.

Best, Rosa

user5576
  • 119
  • 2
  • 7

5 Answers5

9

You need DataFrameGroupBy.idxmin for indexes of minimal Distance per group and then select rows by loc:

df1 = df.loc[df.groupby('City', sort=False)['Distance'].idxmin()]
print (df1)
     City  Distance
0  London         5
1   Paris         1
3      NY         2

Detail:

print (df.groupby('City', sort=False)['Distance'].idxmin())
City
London    0
Paris     1
NY        3
Name: Distance, dtype: int64
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • I have a similar problem but I would like to find the minimum that is not equal to 0 and have two columns instead of 1. – Geosphere Mar 24 '18 at 19:19
  • @Geosphere - Not sure if understand, can you create new question with sample data, expected output and what you try? – jezrael Mar 24 '18 at 19:25
  • Yes. I have 3 columns in a dataframe: x, y, time. What I want to retrieve is the row that has the minimum time but the time should be greater than 0 because in that case the x, y have the same value. – Geosphere Mar 24 '18 at 19:46
  • I have tried using df.min() but kept getting a series instead and I want the row. I also tried df['time'].drop_duplicates().nsmallest(1) but I'm still getting the 0.0 as lowest value. – Geosphere Mar 24 '18 at 19:48
  • @Geosphere - Can you create new question? be free modify `df = pd.DataFrame({'City' : ['London', 'Paris', 'Paris','NY' 'London'], 'Distance' : [5, 1, 7, 2, 6]})` and add expected output - create answer from your comments :) – jezrael Mar 24 '18 at 19:49
  • Another question on Stackoverflow? – Geosphere Mar 24 '18 at 19:50
  • @Geosphere - Yes, try create new one :) – jezrael Mar 24 '18 at 19:51
  • https://stackoverflow.com/questions/49469300/pandas-dataframe-find-the-row-with-minimum-value-based-on-two-columns-but-grea – Geosphere Mar 24 '18 at 20:02
  • When I run this, I get a future warning: `Passing list-likes to .loc or [] with any missing label will raise KeyError in the future, you can use .reindex() as an alternative`. Is this an issue that we will have to worry about? – ifly6 May 01 '19 at 18:06
5

Sometime groupby is unnecessary, try drop_duplicates

df.sort_values('Distance').drop_duplicates('City')

Out[377]: 
     City  Distance
0  London         5
1   Paris         1
3      NY         2
IanS
  • 15,771
  • 9
  • 60
  • 84
BENY
  • 317,841
  • 20
  • 164
  • 234
  • 1
    It is my opinion that this approach should be quick. Also, it is unnecessary to include `'City'` in the sort. And if you leave it out, you don't need the other `sort_index`. `df.sort_values('City').drop_duplicates('City') ` – piRSquared Oct 05 '17 at 14:23
  • 1
    I like your solution. Reminds me of [this discussion](https://stackoverflow.com/a/46490720/5276797) we had :D – IanS Oct 05 '17 at 15:13
1

You can use

>>> df.groupby(['City'], sort=False)['Distance'].min()
City
London    5
Paris     1
NY        2
Name: Distance, dtype: int64
akash karothiya
  • 5,736
  • 1
  • 19
  • 29
  • 3
    Unfortunately, this doesn't solve the problem expressed in the OP. The question clearly asks to return the rows for which the minimum resides, for each group. This will always return a series with unique cities in the index and the corresponding minimum distances. Any other columns would be lost. – piRSquared Oct 05 '17 at 14:27
1

My opinion is that @jezrael offers the most idiomatic approach within a groupby. I've offered the same solution myself on other answers. However, here are some other alternatives.

Option 1
Use pd.DataFrame.nsmallest within an apply
This offers clean logic even if the api is a bit clumsy. I think this version of nsmallest should be available to the groupby object. But as of pandas 0.20.3, it is not. So we use it within the general purpose apply method. Make sure to use group_keys=False in the call to groupby in order to avoid awkward additional indices.

df.groupby('City', group_keys=False).apply(
    lambda d: d.nsmallest(1, columns='Distance'))

     City  Distance
0  London         5
3      NY         2
1   Paris         1

Option 2
Was taken by @Wen so I deleted.

piRSquared
  • 285,575
  • 57
  • 475
  • 624
0

It´s an old question but for the sake of completion:

df.sort_values(by=['Distance'], ascending=True).groupby('City').first().reset_index(drop=False)

provides the same output.

Bennimi
  • 416
  • 5
  • 14