Remove NAs in each columns separatly and join them - Python

Question

For example, if I have this data:

X1  X2  X3 
a   b   Na 
Na  Na  Na 
b   Na  a 
c   c   Na

The final result would be something like:

    X1  X2  X3 
    a   b   a
    b   c   Na
    c   Na  Na

I tried this funcion:

df.apply(lambda x: pd.Series(pd.unique(x)))

but I get:

    X1  X2  X3 
    a   b   Na 
    b   c   a 
    c   Na

How can I use the function but implementing ignore the NAs in pd.unique(x)

Thanks!

jezrael · Accepted Answer · 2020-01-27T11:29:32.630

I think you need Series.dropna:

df = df.apply(lambda x: pd.Series(x.dropna().to_numpy()))

print (df)
  X1   X2   X3
0  a    b    a
1  b    c  NaN
2  c  NaN  NaN

For improve performance is possible use a bit changed justify function by Divakar:

def justify(a, invalid_val=0, axis=1, side='left'):    
    """
    Justifies a 2D array

    Parameters
    ----------
    A : ndarray
        Input array to be justified
    axis : int
        Axis along which justification is to be made
    side : str
        Direction of justification. It could be 'left', 'right', 'up', 'down'
        It should be 'left' or 'right' for axis=1 and 'up' or 'down' for axis=0.

    """

    if invalid_val is np.nan:
        #change to notnull
        mask = pd.notnull(a)
    else:
        mask = a!=invalid_val
    justified_mask = np.sort(mask,axis=axis)
    if (side=='up') | (side=='left'):
        justified_mask = np.flip(justified_mask,axis=axis)
    #change dtype to object
    out = np.full(a.shape, invalid_val, dtype=object)  
    if axis==1:
        out[justified_mask] = a[mask]
    else:
        out.T[justified_mask.T] = a.T[mask.T]
    return out

df = pd.DataFrame(justify(df.values, invalid_val=np.nan, side='up', axis=0), 
                  columns=df.columns).dropna(how='all')
print (df)
  X1   X2   X3
0  a    b    a
1  b    c  NaN
2  c  NaN  NaN

Great! Exactly what I was looking for – Aibloy Jan 27 '20 at 11:29 — Aibloy, Jan 27 '20 at 11:29

score 2 · Answer 2 · answered Jan 27 '20 at 11:30

IIUC here's a NumPy based approach:

import numpy as np
a = np.take_along_axis(df.values, df.isna().values.argsort(0), 0)
pd.DataFrame(a, columns=df.columns)

    X1   X2   X3
0    a    b    a
1    b    c  NaN
2    c  NaN  NaN
3  NaN  NaN  NaN

Double check your missing values are actual np.nans, otherwise you can use:

df.replace('Na', float('nan'), inplace=True)

score 1 · Answer 3 · answered Jan 27 '20 at 11:27

1

df.apply(lambda x: x.dropna().reset_index(drop=True))

Or:

df.apply(lambda x: x.dropna().tolist()).apply(pd.Series).T


    X1  X2  X3
0   a   b   a
1   b   c   NaN
2   c   NaN NaN

answered Jan 27 '20 at 11:27

Allen Qin

19,507
8
51
67

score 0 · Answer 4 · answered Jan 27 '20 at 11:27

0

Another solution, using pd.concat():

print( pd.concat((df[c].dropna().reset_index(drop=True) for c in df.columns), axis=1) )

Prints:

  X1   X2   X3
0  a    b    a
1  b    c  NaN
2  c  NaN  NaN

answered Jan 27 '20 at 11:27

Andrej Kesely

168,389
15
48
91

score 0 · Answer 5 · edited Jan 27 '20 at 14:46

0

I would just add the function .dropna() to yours:

df.apply(lambda x: pd.Series(pd.unique(x.dropna())))

hope this helps

edited Jan 27 '20 at 14:46

nucsit026

652
7
16

answered Jan 27 '20 at 11:32

el_Rinaldo

970
9
26

Remove NAs in each columns separatly and join them - Python

5 Answers5