I have seen a number of other related SO questions like this and this, but they do not seem to be exactly what I want. Suppose I have a dataframe like this:
import pandas as pd
df = pd.DataFrame(columns=['patient', 'parent csn', 'child csn', 'days'])
df.loc[0] = [0, 0, 10, 5]
df.loc[1] = [0, 0, 11, 3]
df.loc[2] = [0, 1, 12, 6]
df.loc[3] = [0, 1, 13, 4]
df.loc[4] = [1, 2, 20, 4]
df
Out[9]:
patient parent csn child csn days
0 0 0 10 5
1 0 0 11 3
2 0 1 12 6
3 0 1 13 4
4 1 2 20 4
Now what I want to do is something like this:
grp_df = df.groupby(['parent csn']).min()
The problem is that the result computes the min across all columns (that aren't parent csn
), and that produces:
grp_df
patient child csn days
parent csn
0 0 10 3
1 0 12 4
2 1 20 4
You can see that for the first row, the days
number and the child csn
number are no longer on the same row, like they were before grouping. Here's the output I want:
grp_df
patient child csn days
parent csn
0 0 11 3
1 0 13 4
2 1 20 4
How can I get that? I have code that iterates through the dataframe, and I think it will work, but it is slow as all get-out, even with Cython. I feel like this should be obvious, but I am not finding it so.
I looked at this question as well, but putting the child csn
in the groupby list will not work, because child csn
varies as days
.
This question seems more likely, but I'm not finding the solutions very intuitive.
This question also seems likely, but again, the answers aren't very intuitive, plus I do want only one row for each parent csn
.
One other detail: the row containing the minimum days
value might not be unique. In that case, I just want one row - I don't care which.
Many thanks for your time!