How to compress dataframe by removing columns that contains 'NaN' value in between columns that has a value?

Question

I am currently following the answer here. It mostly worked but when I viewed the whole dataframe, I saw that there are columns that contains 'NaN' values in between columns that do contain a value.

For example I keep getting a result of something like this:

     ID | 0  | 1  |   2  |  3   | 4   | 5  | 6  |  7   |  8   | 9
300 1001|1001|1002|  NaN | NaN  | NaN |1001|1002|  NaN | NaN  | NaN   
301 1010|1010|NaN |  NaN | 1000 | 2000|1234| NaN|  NaN | 1213 | 1415
302 1100|1234|5678| 9101 | 1121 | 3141|2345|6789| 1011 | 1617 | 1819
303 1000|2001|9876|  NaN | NaN  | NaN |1001|1002|  NaN | NaN  | NaN

Is there a way to remove those cells that contains NaN such that the output would be like this:

     ID | 0  | 1  |   2  |  3   | 4   | 5  | 6  |  7   |  8   | 9
300 1001|1001|1002|  1001| 1002 | NaN |NaN | NaN|  NaN | NaN  | NaN   
301 1010|1010|1000|  2000| 1234 | 1213|1415| NaN|  NaN | NaN  | NaN
302 1100|1234|5678|  9101| 1121 | 3141|2345|6789| 1011 | 1617 | 1819
303 1000|2001|9876|  1001| 1002 | NaN |NaN |NaN |  NaN | NaN  | NaN

score 3 · Accepted Answer · answered Jun 07 '19 at 04:52

Using pd.DataFrame.iterrows with pd.concat:

import pandas as pd

df[df.columns] = pd.concat([s.dropna().reset_index(drop=True) for i,s in df.iterrows()], 1).T

Output:

         ID     0     1     2     3     4     5     6     7     8     9
0  300 1001  1001  1002  1001  1002   NaN   NaN   NaN   NaN   NaN   NaN
1  301 1010  1010  1000  2000  1234  1213  1415   NaN   NaN   NaN   NaN
2  302 1100  1234  5678  9101  1121  3141  2345  6789  1011  1617  1819
3  303 1000  2001  9876  1001  1002   NaN   NaN   NaN   NaN   NaN   NaN

ahh i didnt think of truncating it like that ... nice – Joran Beasley Jun 07 '19 at 04:59 — Joran Beasley, Jun 07 '19 at 04:59

score 1 · Answer 2 · answered Jun 07 '19 at 04:58

1

just sort each row first by key np.isnan

import pandas as pd
import numpy as np
raw = [ [1,2,np.nan,3,np.nan],
        [1,np.nan,3,2,7]]
original = pd.DataFrame(raw)
s = original.apply(lambda x:pd.Series(sorted(x,key=np.isnan)),axis=1)
print(s)

answered Jun 07 '19 at 04:58

Joran Beasley

110,522
12
160
179

jezrael · Answer 3 · 2019-06-07T05:14:19.797

Use justify if performance is important:

df = pd.DataFrame(justify(df.to_numpy(), invalid_val=np.nan), 
                  index=df.index, 
                  columns=df.columns)
print (df)
         ID       0       1       2       3       4       5       6       7  \
300  1001.0  1001.0  1002.0  1001.0  1002.0     NaN     NaN     NaN     NaN   
301  1010.0  1010.0  1000.0  2000.0  1234.0  1213.0  1415.0     NaN     NaN   
302  1100.0  1234.0  5678.0  9101.0  1121.0  3141.0  2345.0  6789.0  1011.0   
303  1000.0  2001.0  9876.0  1001.0  1002.0     NaN     NaN     NaN     NaN   

          8       9  
300     NaN     NaN  
301     NaN     NaN  
302  1617.0  1819.0  
303     NaN     NaN

If first column should be non numeric or possible missing values solution is applied for all columns without first and added first column later by insert:

df.columns = df.columns[:1].tolist() + df.columns[1:].astype(int).tolist()

arr = justify(df.to_numpy()[:, 1:], invalid_val=np.nan)
df1 = pd.DataFrame(arr, index=df.index, columns=df.columns[1:] + 1)
df1.insert(0,'ID', df['ID'])
print (df1)
       ID       1       2       3       4       5       6       7       8  \
300  1001  1001.0  1002.0  1001.0  1002.0     NaN     NaN     NaN     NaN   
301  1010  1010.0  1000.0  2000.0  1234.0  1213.0  1415.0     NaN     NaN   
302  1100  1234.0  5678.0  9101.0  1121.0  3141.0  2345.0  6789.0  1011.0   
303  1000  2001.0  9876.0  1001.0  1002.0     NaN     NaN     NaN     NaN   

          9      10  
300     NaN     NaN  
301     NaN     NaN  
302  1617.0  1819.0  
303     NaN     NaN

How to compress dataframe by removing columns that contains 'NaN' value in between columns that has a value?

3 Answers3