Select non-null rows from a specific column in a DataFrame and take a sub-selection of other columns

Question

I have a dataframe which has several columns, so I chose some of its columns to create a variable like this.

xtrain = df[['Age', 'Fare', 'Group_Size', 'deck', 'Pclass', 'Title']]

I want to drop from these columns all rows where the Survive column in the main dataframe is nan.

Related: https://stackoverflow.com/q/49673345/6064933 – jdhao Oct 03 '22 at 09:14 — jdhao, Oct 03 '22 at 09:14

EdChum · Accepted Answer · 2016-12-27T00:17:12.747

You can pass a boolean mask to your df based on notnull() of 'Survive' column and select the cols of interest:

In [2]:
# make some data
df = pd.DataFrame(np.random.randn(5,7), columns= ['Survive', 'Age','Fare', 'Group_Size','deck', 'Pclass', 'Title' ])
df['Survive'].iloc[2] = np.NaN
df
Out[2]:
    Survive       Age      Fare  Group_Size      deck    Pclass     Title
0  1.174206 -0.056846  0.454437    0.496695  1.401509 -2.078731 -1.024832
1  0.036843  1.060134  0.770625   -0.114912  0.118991 -0.317909  0.061022
2       NaN -0.132394 -0.236904   -0.324087  0.570660  0.758084 -0.176421
3 -2.145934 -0.020003 -0.777785    0.835467  1.498284 -1.371325  0.661991
4 -0.197144 -0.089806 -0.706548    1.621260  1.754292  0.725897  0.860482

Now pass a mask to loc to take only non NaN rows:

In [3]:
xtrain = df.loc[df['Survive'].notnull(), ['Age','Fare', 'Group_Size','deck', 'Pclass', 'Title' ]]
xtrain

Out[3]:
        Age      Fare  Group_Size      deck    Pclass     Title
0 -0.056846  0.454437    0.496695  1.401509 -2.078731 -1.024832
1  1.060134  0.770625   -0.114912  0.118991 -0.317909  0.061022
3 -0.020003 -0.777785    0.835467  1.498284 -1.371325  0.661991
4 -0.089806 -0.706548    1.621260  1.754292  0.725897  0.860482

Just wish to know why the 'Survive' column is completely off in the output? The question asks for dropping all rows that have NaNs, not the entire columns that may have one or more NaNs. — MuneshSingh, Jun 25 '22 at 17:02

piRSquared · Answer 2 · 2016-12-27T00:29:07.477

10

Two alternatives because... well why not?
Both drop nan prior to column slicing. That's two call rather than EdChum's one call.

one

df.dropna(subset=['Survive'])[
    ['Age','Fare', 'Group_Size','deck', 'Pclass', 'Title' ]]

two

df.query('Survive == Survive')[
    ['Age','Fare', 'Group_Size','deck', 'Pclass', 'Title' ]]

edited Dec 27 '16 at 00:29

answered Dec 27 '16 at 00:27

piRSquared

285,575
57
475
624

df.dropna(subset=['Survive'])[['Survive','Age','Fare', 'Group_Size','deck', 'PCLass', 'Title' ]] will retain the 'Survive' column too. – MuneshSingh Jun 25 '22 at 17:48

score 0 · Answer 3 · answered Feb 14 '23 at 22:34

It might be more readable if you assign the subset of the columns to a variable and filter.

notna_msk = df['Survive'].notna()
cols = ['Age', 'Fare', 'Group_Size', 'deck', 'Pclass', 'Title', 'Survive']
new_df = df.loc[notna_msk, cols]

Also, in case you already created xtrain from df as in the OP, then you can still filter this dataframe with the mask, even if it doesn't have Survive column; just the index is enough.

new_df = xtrain.loc[df['Survive'].notna()]

Select non-null rows from a specific column in a DataFrame and take a sub-selection of other columns

3 Answers3