0

Suppose I have a dataframe of the form

  X Y
0 1 1
1 1 2
2 1 3
3 2 4
4 2 5
5 2 6
6 3 7
7 3 8
8 3 9

I want my new DataFrame to be such that for every unique value of X, i.e. 1, 2, and 3, I isolate the row where Y is at its minimum. So I want my final DataFrame to look like:

  X Y
0 1 1
1 2 4
2 3 7

I am familiar with how to get each unique value of X. It would be df['X'].unique(). So my naive approach was initially to simply do:

X_list = []
for i in df['X'].unique():
    i_df = df.loc[df['X'] == i]
    X_list.append(i_df.loc[i_df['Y'].idxmin()])
new_df = pd.DataFrame(X_list)

However, this brings up problems because the indices get all messed up if you try to just merge these isolated rows together by making a new DataFrame out of them. Namely I get a KeyError suggesting a row of such an index does not exist if I try to access these rows in the new DataFrame by their original index from the old DataFrame. I initially tried to solve this with the reset_index function but that didn't work either.

So I would like to know if there is some clean way to query this from the original DataFrame or if there is an augmentation to my solution that makes it work like you would expect from a DataFrame without weird indexing problems. I suspect there is maybe some way to do this using the groupby function, but I can't really think of how.

Orren Ravid
  • 560
  • 1
  • 6
  • 24

3 Answers3

1

Use GroupBy.min:

df.groupby('X')['Y'].min().reset_index()
Erfan
  • 40,971
  • 8
  • 66
  • 78
0

You are correct in assuming groupby is an approach. df.groupby(['X']).min() You can replace min with other aggregation functions, such as count, max, sum. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html

AmourK
  • 720
  • 1
  • 8
  • 20
0

Here is something that might work. In your case, set find_max=False.

import pandas
import operator

def filter_df(df=None, col=None, find_max=True):
    """
    Serializes one column by optimizing another.  Returns
    another dataframe.  Serialized column may not be sorted.
    """
    if not df or not col: pass
    if find_max: evaluate = operator.ge
    if not find_max: evaluate = operator.le
    d = {}
    for row in df.items:
        if row not in d.keys():
            d[row] = df.loc[row]
        elif evaluate(df.loc[row][col], d[row][col]):
            d[row] = df.loc[row]
    return pandas.DataFrame(d)
  • I do appreciate your approach, however I think the one liner suggested by @Erfan fits my purposes in a cleaner way. – Orren Ravid Nov 12 '19 at 17:20
  • Np, makes sense –  Nov 12 '19 at 17:28
  • Iterating over a dataframe is something you only want to do as a last resort. Vectorized solutions are the way to go. [Here's](https://stackoverflow.com/a/55557758/9081267) a good read on it – Erfan Nov 12 '19 at 18:28
  • @Erfan I suppose this is why `DataFrame` doesn’t have a parameterless pop method for rows... (edit: as far as I can tell) –  Nov 12 '19 at 18:48