1

I got similar issue with my previous question:

Remove zero from each column and rearranging it with python pandas/numpy

But in this case, I need to remove NaN. I have tried many solutions including modifying solutions from my previous post:

a = a[a!=np.nan].reshape(-1,3)

but it gave me weird result. Here is my initial matrix from Dataframe :

 A   B   C   D    E   F
nan nan nan 0.0  27.7 nan
nan nan nan 5.0  27.5 nan
nan nan nan 10.0 27.4 nan
0.0  29.8 nan nan nan nan
5.0  29.9 nan nan nan nan
10.0 30.0 nan nan nan nan
nan nan 0.0  28.6 nan nan 
nan nan 5.0  28.6 nan nan 
nan nan 10.0 28.5 nan nan 
nan nan 15.0 28.4 nan nan 
nan nan 20.0 28.3 nan nan 
nan nan 25.0 28.2 nan nan

And I expect to have result like this :

 A    B
0.0  27.7
5.0  27.5
10.0 27.4
0.0  29.8 
5.0  29.9 
10.0 30.0 
0.0  28.6 
5.0  28.6
10.0 28.5 
15.0 28.4 
0.0  28.3 
25.0 28.2
Marios
  • 26,333
  • 8
  • 32
  • 52
ShortHair
  • 109
  • 4
  • 1
    A NaN in numpy will never equal to NaN. You have `isnan` for this. Adapting the previous answer is straight forward with this change – yatu Aug 25 '20 at 10:37
  • yes you're right.. I didn't notice there's isnan to adapt with different problem. My bad to not paying attention with it. Thanks anyway – ShortHair Aug 25 '20 at 10:46
  • NaN means Not A Number. If there are two variables that are not a number (say they are "A" & "B" respectively), they may not be necessarily equal to each other. Think about it. – Ken T Aug 25 '20 at 11:29

2 Answers2

5

Solution:

Given the input dataframe a:

a.apply(lambda x: pd.Series(x.dropna().values)).dropna(axis='columns')

This will give you the desired output.


Example:

import numpy as np
import pandas as pd

a = pd.DataFrame({ 'A':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
                   'B':[np.nan,np.nan,np.nan,np.nan,np.nan,4],
                   'C':[7,np.nan,9,np.nan,2,np.nan],
                   'D':[1,3,np.nan,7,np.nan,np.nan],
                   'E':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan]})

print (a)

         A    B    C    D   E
      0 NaN  NaN  7.0  1.0 NaN
      1 NaN  NaN  NaN  3.0 NaN
      2 NaN  NaN  9.0  NaN NaN
      3 NaN  NaN  NaN  7.0 NaN
      4 NaN  NaN  2.0  NaN NaN
      5 NaN  4.0  NaN  NaN NaN

a_new = a.apply(lambda x: pd.Series(x.dropna().values)).dropna(axis='columns')

print(a_new)

       C    D
   0  7.0  1.0
   1  9.0  3.0
   2  2.0  7.0
Marios
  • 26,333
  • 8
  • 32
  • 52
1

Use np.isnan for test missing values with ~ for invert mask if there are always 2 non missing values per rows:

a = df.to_numpy()
df = pd.DataFrame(a[~np.isnan(a)].reshape(-1,2))
print (df)
       0     1
0    0.0  27.7
1    5.0  27.5
2   10.0  27.4
3    0.0  29.8
4    5.0  29.9
5   10.0  30.0
6    0.0  28.6
7    5.0  28.6
8   10.0  28.5
9   15.0  28.4
10  20.0  28.3
11  25.0  28.2

Another idea is use justify fucntion with remove only NaNs columns:

df1 = (pd.DataFrame(justify(a, invalid_val=np.nan),
                    columns=df.columns).dropna(how='all', axis=1))
print (df1)
       A     B
0    0.0  27.7
1    5.0  27.5
2   10.0  27.4
3    0.0  29.8
4    5.0  29.9
5   10.0  30.0
6    0.0  28.6
7    5.0  28.6
8   10.0  28.5
9   15.0  28.4
10  20.0  28.3
11  25.0  28.2

EDIT:

df = pd.concat([df] * 1000, ignore_index=True)

a = df.to_numpy()
print (a.shape)
(12000, 6)


In [168]: %timeit df.apply(lambda x: pd.Series(x.dropna().values)).dropna(axis='columns')
8.06 ms ± 597 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [172]: %%timeit
     ...: a = df.to_numpy()
     ...: pd.DataFrame(a[~np.isnan(a)].reshape(-1,2))
     ...: 
     ...: 
     ...: 
422 µs ± 3.52 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [173]: %timeit pd.DataFrame(justify(a, invalid_val=np.nan),columns=df.columns).dropna(how='all', axis=1)
2.88 ms ± 112 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • 1
    Thank you so much! it works perfectly! it gave me more perspective to use pandas/numpy – ShortHair Aug 25 '20 at 10:48
  • 1
    @diyon - Performance is not important? If yes, dont use `apply` solution. – jezrael Aug 25 '20 at 11:07
  • for me as long as it solves my problem I don't mind use it :D .. just need to practice more to solve this kind of problem – ShortHair Aug 25 '20 at 11:10
  • @diyon - not understand, so performance is not important? Added some timengs to answer in small dataframe (11k rows) for see `.apply` is bad decision here. – jezrael Aug 25 '20 at 11:15
  • @diyon - ya, it is up to you. I suggest dont use `apply`, but if data are small or perfromance is not important nefer mind. Good luck! – jezrael Aug 25 '20 at 11:18