2

I have seen a number of other related SO questions like this and this, but they do not seem to be exactly what I want. Suppose I have a dataframe like this:

import pandas as pd
df = pd.DataFrame(columns=['patient', 'parent csn', 'child csn', 'days'])
df.loc[0] = [0, 0, 10, 5]
df.loc[1] = [0, 0, 11, 3]
df.loc[2] = [0, 1, 12, 6]
df.loc[3] = [0, 1, 13, 4]
df.loc[4] = [1, 2, 20, 4]
df
Out[9]: 
  patient parent csn child csn days
0       0          0        10    5
1       0          0        11    3
2       0          1        12    6
3       0          1        13    4
4       1          2        20    4

Now what I want to do is something like this:

grp_df = df.groupby(['parent csn']).min()

The problem is that the result computes the min across all columns (that aren't parent csn), and that produces:

grp_df
            patient  child csn  days
parent csn                          
0                 0         10     3
1                 0         12     4
2                 1         20     4

You can see that for the first row, the days number and the child csn number are no longer on the same row, like they were before grouping. Here's the output I want:

grp_df
            patient  child csn  days
parent csn                          
0                 0         11     3
1                 0         13     4
2                 1         20     4

How can I get that? I have code that iterates through the dataframe, and I think it will work, but it is slow as all get-out, even with Cython. I feel like this should be obvious, but I am not finding it so.

I looked at this question as well, but putting the child csn in the groupby list will not work, because child csn varies as days.

This question seems more likely, but I'm not finding the solutions very intuitive.

This question also seems likely, but again, the answers aren't very intuitive, plus I do want only one row for each parent csn.

One other detail: the row containing the minimum days value might not be unique. In that case, I just want one row - I don't care which.

Many thanks for your time!

Adrian Keister
  • 842
  • 3
  • 15
  • 33

4 Answers4

6

You can do this by using .idxmin() instead of .min() to get the index (row identifier) where "days" is at it minimum for each group:

data creation:

import pandas as pd

data = [[0, 0, 10, 5],
        [0, 0, 11, 3],
        [0, 1, 12, 6],
        [0, 1, 13, 4],
        [1, 2, 20, 4]]
df = pd.DataFrame(data, columns=['patient', 'parent csn', 'child csn', 'days'])

print(df)
   patient  parent csn  child csn  days
0        0           0         10     5
1        0           0         11     3
2        0           1         12     6
3        0           1         13     4
4        1           2         20     4
day_minimum_row_indices = df.groupby("parent csn")["days"].idxmin()

print(day_minimum_row_indices)
parent csn
0    1
1    3
2    4
Name: days, dtype: int64

From this you can see that the group parent csn 0 had a minimum number of days at row 1. Looking back to our original dataframe, we can see that row 1 had days == 3 and is infact the location of the minimum days for parent csn == 0. Parent csn == 1 had a minimum days at row 3, so on and so forth.

We can use the row indices to subset back into our original dataframe:

new_df = df.loc[day_minimum_row_indices]

print(new_df)
   patient  parent csn  child csn  days
1        0           0         11     3
3        0           1         13     4
4        1           2         20     4

Edit (tldr):

df.loc[df.groupby("parent csn")["days"].idxmin()]
Cameron Riddell
  • 10,942
  • 9
  • 19
4

As your desired ouput, you need sort_values and groupby first

df_final = (df.sort_values(['parent csn', 'patient', 'days', 'parent csn'])
              .groupby('parent csn').first())

Out[813]:
            patient  child csn  days
parent csn
0                 0         11     3
1                 0         13     4
2                 1         20     4
Andy L.
  • 24,909
  • 4
  • 17
  • 29
1

You can filter by the dataframe for the rows you need using groupby to create the filter rather than just using .groupby:

s = df.groupby('parent csn')['days'].transform('min') == df['days']
df = df[s]
df

Out[1]: 
   patient  parent csn  child csn  days
1        0           0         11     3
3        0           1         13     4
4        1           2         20     4

For example, this is is how it would look like if I put s in my dataframe. Then you just filter for the True rows which are the ones where minimum days per group are equal to that row.

Out[2]: 
   patient  parent csn  child csn  days      s
0        0           0         10     5  False
1        0           0         11     3   True
2        0           1         12     6  False
3        0           1         13     4   True
4        1           2         20     4   True
David Erickson
  • 16,433
  • 2
  • 19
  • 35
1

For some reason I can't explain your dataframe has columns of type object. This solution only works with numerical columns

df.days = df.days.astype(int)
df.iloc[df.groupby('parent csn').days.idxmin()]

Out:

  patient parent csn child csn  days
1       0          0        11     3
3       0          1        13     4
4       1          2        20     4
Michael Szczesny
  • 4,911
  • 5
  • 15
  • 32
  • 2
    It's because the dataframe began empty. Pandas doesnt assume the datatype of columns with nothing in it so it leaves them as "object" dtypes (which is the most flexible). Then when you "fill" the columns in via `.loc` they retain their "object" dtype. On an unrelated note, you should also use `.loc` for your answer because `idxmin()` returns the corresponding index associated with the minimum- so this answer would fail if the index were ["a", "b", "c", "d", "e"] – Cameron Riddell Oct 21 '20 at 22:35
  • 1
    @AdrianKeister - Please consider to accept [@CameronRiddell's answer](https://stackoverflow.com/a/64472611/14277722) instead. It's the same idea but more informative for other readers and was posted before my solution. – Michael Szczesny Oct 21 '20 at 22:40
  • Very well, if you wish. I can see they're more-or-less the same. Your answer is nice because the code's all in one place, though. – Adrian Keister Oct 21 '20 at 22:42
  • @MichaelSzczesny I appreciate it! I've added a "tldr" to my answer. – Cameron Riddell Oct 21 '20 at 22:47