Remove NaN from each column and rearranging it with python pandas/numpy

Question

I got similar issue with my previous question:

Remove zero from each column and rearranging it with python pandas/numpy

But in this case, I need to remove NaN. I have tried many solutions including modifying solutions from my previous post:

a = a[a!=np.nan].reshape(-1,3)

but it gave me weird result. Here is my initial matrix from Dataframe :

 A   B   C   D    E   F
nan nan nan 0.0  27.7 nan
nan nan nan 5.0  27.5 nan
nan nan nan 10.0 27.4 nan
0.0  29.8 nan nan nan nan
5.0  29.9 nan nan nan nan
10.0 30.0 nan nan nan nan
nan nan 0.0  28.6 nan nan 
nan nan 5.0  28.6 nan nan 
nan nan 10.0 28.5 nan nan 
nan nan 15.0 28.4 nan nan 
nan nan 20.0 28.3 nan nan 
nan nan 25.0 28.2 nan nan

And I expect to have result like this :

A NaN in numpy will never equal to NaN. You have `isnan` for this. Adapting the previous answer is straight forward with this change — yatu, Aug 25 '20 at 10:37
yes you're right.. I didn't notice there's isnan to adapt with different problem. My bad to not paying attention with it. Thanks anyway — ShortHair, Aug 25 '20 at 10:46
NaN means Not A Number. If there are two variables that are not a number (say they are "A" & "B" respectively), they may not be necessarily equal to each other. Think about it. — Ken T, Aug 25 '20 at 11:29

Marios · Accepted Answer · 2020-08-25T10:44:23.493

5

Solution:

Given the input dataframe a:

a.apply(lambda x: pd.Series(x.dropna().values)).dropna(axis='columns')

This will give you the desired output.

Example:

import numpy as np
import pandas as pd

a = pd.DataFrame({ 'A':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
                   'B':[np.nan,np.nan,np.nan,np.nan,np.nan,4],
                   'C':[7,np.nan,9,np.nan,2,np.nan],
                   'D':[1,3,np.nan,7,np.nan,np.nan],
                   'E':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan]})

print (a)

         A    B    C    D   E
      0 NaN  NaN  7.0  1.0 NaN
      1 NaN  NaN  NaN  3.0 NaN
      2 NaN  NaN  9.0  NaN NaN
      3 NaN  NaN  NaN  7.0 NaN
      4 NaN  NaN  2.0  NaN NaN
      5 NaN  4.0  NaN  NaN NaN

a_new = a.apply(lambda x: pd.Series(x.dropna().values)).dropna(axis='columns')

print(a_new)

       C    D
   0  7.0  1.0
   1  9.0  3.0
   2  2.0  7.0

edited Aug 25 '20 at 10:44

answered Aug 25 '20 at 10:36

Marios

26,333
8
32
52

Thank you! so it basically automatically fills up NaN from nearest cell right? – ShortHair Aug 25 '20 at 10:52
Yes. For documentation purposes, please accept the answer that helped you the most. – Marios Aug 25 '20 at 10:54
Okay got it ... hmm but the 5th row is removed? – ShortHair Aug 25 '20 at 11:07
dont use `apply` solutions if performance is important if exist vectorized, because apply are loops under the hood. – jezrael Aug 25 '20 at 11:07

jezrael · Answer 2 · 2020-08-25T11:14:13.250

Use np.isnan for test missing values with ~ for invert mask if there are always 2 non missing values per rows:

a = df.to_numpy()
df = pd.DataFrame(a[~np.isnan(a)].reshape(-1,2))
print (df)
       0     1
0    0.0  27.7
1    5.0  27.5
2   10.0  27.4
3    0.0  29.8
4    5.0  29.9
5   10.0  30.0
6    0.0  28.6
7    5.0  28.6
8   10.0  28.5
9   15.0  28.4
10  20.0  28.3
11  25.0  28.2

Another idea is use justify fucntion with remove only NaNs columns:

df1 = (pd.DataFrame(justify(a, invalid_val=np.nan),
                    columns=df.columns).dropna(how='all', axis=1))
print (df1)
       A     B
0    0.0  27.7
1    5.0  27.5
2   10.0  27.4
3    0.0  29.8
4    5.0  29.9
5   10.0  30.0
6    0.0  28.6
7    5.0  28.6
8   10.0  28.5
9   15.0  28.4
10  20.0  28.3
11  25.0  28.2

EDIT:

df = pd.concat([df] * 1000, ignore_index=True)

a = df.to_numpy()
print (a.shape)
(12000, 6)


In [168]: %timeit df.apply(lambda x: pd.Series(x.dropna().values)).dropna(axis='columns')
8.06 ms ± 597 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [172]: %%timeit
     ...: a = df.to_numpy()
     ...: pd.DataFrame(a[~np.isnan(a)].reshape(-1,2))
     ...: 
     ...: 
     ...: 
422 µs ± 3.52 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [173]: %timeit pd.DataFrame(justify(a, invalid_val=np.nan),columns=df.columns).dropna(how='all', axis=1)
2.88 ms ± 112 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Thank you so much! it works perfectly! it gave me more perspective to use pandas/numpy — ShortHair, Aug 25 '20 at 10:48
@diyon - Performance is not important? If yes, dont use `apply` solution. — jezrael, Aug 25 '20 at 11:07
for me as long as it solves my problem I don't mind use it :D .. just need to practice more to solve this kind of problem — ShortHair, Aug 25 '20 at 11:10
@diyon - not understand, so performance is not important? Added some timengs to answer in small dataframe (11k rows) for see `.apply` is bad decision here. — jezrael, Aug 25 '20 at 11:15
@diyon - ya, it is up to you. I suggest dont use `apply`, but if data are small or perfromance is not important nefer mind. Good luck! — jezrael, Aug 25 '20 at 11:18

Remove NaN from each column and rearranging it with python pandas/numpy

2 Answers2

Solution:

Example: