I have the following python pandas dataframe:
df = pd.DataFrame({'Id': ['1', '1', '1', '2', '2', '3'], 'A': ['TRUE', 'TRUE', 'TRUE', 'TRUE', 'TRUE', 'FALSE'], 'B': [np.nan, np.nan, 'abc', np.nan, np.nan, 'def'],'C': [np.nan, np.nan, np.nan, np.nan, np.nan, '456']})
>>> print(df)
Id A B C
0 1 TRUE NaN NaN
1 1 TRUE NaN NaN
2 1 TRUE abc NaN
3 2 TRUE NaN NaN
4 2 TRUE NaN NaN
5 3 FALSE def 456
I want to end up with the following dataframe:
>>> print(dfout)
Id A B C
0 1 TRUE abc NaN
The same Id value can appear on multiple rows. Each Id will either have the value TRUE or FALSE in column A consistently on all its rows. Columns B and C can have any value, including NaN.
I want one row in dfout for each Id that has A=TRUE and show the max value seen in columns B and C. But if the only values seen in columns B and C = NaN for all of an Id's rows, then that Id is to be excluded from dfout.
- Id 1 has
A=TRUE
, and hasB=abc
in its third row, so it meets the requirements. - Id 2 has
A=TRUE
, but columns B and C areNaN
for both its rows, so it does not. - Id 3 has
A=FALSE
, so it does not meet requirements.
I created a groupby
df on Id, then applied a mask to only include rows with A=TRUE. But having trouble understanding how to remove the rows with NaN
for all rows in columns B and C.
grouped = df.groupby(['Id'])
mask = grouped['A'].transform(lambda x: 'TRUE' == x.max()).astype(bool)
df.loc[mask].reset_index(drop=True)
Id A B C
0 1 TRUE NaN NaN
1 1 TRUE NaN NaN
2 1 TRUE abc NaN
3 2 TRUE NaN NaN
4 2 TRUE NaN NaN
Then I tried several things along the lines of:
df.loc[mask].reset_index(drop=True).all(['B'],['C']).isnull
But getting errors, like:
" TypeError: unhashable type: 'list' ".
Using python 3.6, pandas 0.23.0; Looked here for help: keep dataframe rows meeting a condition into each group of the same dataframe grouped by