Keep the first n non NaN cells in each row of a pandas DataFrame

Question

I have a Pandas Dataframe with at least 4 non-NaN values on each row, but located at different columns:

Index       Col1     Col2      Col3         Col4     Col5  Col6  Col7  Col8 
1991-12-31  100.000 100.000    100.000     89.123   NaN    NaN   NaN   NaN                     
1992-01-31  98.300  101.530    100.000     NaN      92.342 NaN   NaN   NaN                     
1992-02-29  NaN     100.230    98.713      97.602   NaN    NaN   NaN   NaN                     
1992-03-31  NaN     NaN        102.060     93.473   98.123 NaN   NaN   NaN                     
1992-04-30  NaN     102.205    107.755     94.529   94.529 NaN   NaN   NaN

(I show only the first 8 columns) I would like to turn this into a Data frame with 4 columns for each row. The rows should contain only the first four (reading from left to right) non NaN values for that date.

Edit:

The order on each row matters.

Does order within each row matter? If not, it may be possible to offer a highly performant solution. — cs95, Nov 27 '17 at 11:47
I *does* matter (the rest of this comment is to hit the character limit) — user189035, Nov 27 '17 at 12:12

Divakar · Answer 1 · 2017-11-27T12:33:04.310

Approach #1 : Here's a NumPy solution using justify -

pd.DataFrame(justify(df.values, invalid_val=np.nan, axis=1, side='left')[:,:4])

Sample run -

In [211]: df
Out[211]: 
             Col1     Col2     Col3    Col4    Col5  Col6  Col7  Col8
Index                                                                
1991-12-31  100.0  100.000  100.000  89.123     NaN   NaN   NaN   NaN
1992-01-31   98.3  101.530  100.000     NaN  92.342   NaN   NaN   NaN
1992-02-29    NaN  100.230   98.713  97.602     NaN   NaN   NaN   NaN
1992-03-31    NaN      NaN  102.060  93.473  98.123   NaN   NaN   NaN
1992-04-30    NaN  102.205  107.755  94.529  94.529   NaN   NaN   NaN

In [212]: pd.DataFrame(justify(df.values, invalid_val=np.nan, axis=1, side='left')[:,:4])
Out[212]: 
         0        1        2       3
0  100.000  100.000  100.000  89.123
1   98.300  101.530  100.000  92.342
2  100.230   98.713   97.602     NaN
3  102.060   93.473   98.123     NaN
4  102.205  107.755   94.529  94.529

Approach #2 : Using tailor-made function for masks -

def app2(df, N=4):
    a = df.values
    out = np.empty_like(a)
    mask = df.isnull().values
    mask_sorted = np.sort(mask,1)
    out[~mask_sorted] = a[~mask]
    return pd.DataFrame(out[:,:N])

Runtime test for working solutions that keep order -

# Using df from posted question to recreate a bigger one :
df = df.set_index('Index')
df = pd.concat([df] * 10000, ignore_index=1)

In [298]: %timeit app2(df)
100 loops, best of 3: 4.06 ms per loop

In [299]: %timeit pd.DataFrame(justify(df.values, invalid_val=np.nan, axis=1, side='left')[:,:4])
100 loops, best of 3: 4.78 ms per loop

In [300]: %timeit df.apply(sorted, key=np.isnan, axis=1).iloc[:, :4]
1 loop, best of 3: 4.05 s per loop

@Divakar your solutions here https://stackoverflow.com/questions/46326140/how-to-sort-a-numpy-array-with-key-as-isnan would also be faster. — Bharath M Shetty, Nov 27 '17 at 12:04
I get `TypeError: ufunc 'isnan' not supported for the input types` at the `np.isnan(a)` line. Any idea? — cs95, Nov 27 '17 at 12:14
@colspeed I think you can try `pd.isnull` instead (jsut drop in replace, it should work) — user189035, Nov 27 '17 at 12:14
Ah, nevermind. It was my mistake not having set the index. Btw this appears a couple ms slower than your previous solution. — cs95, Nov 27 '17 at 12:17
@Divakar: thanks for your answer. I ended up using COLDSPEED's but all answers are super cool, it's really a subjective call. — user189035, Nov 27 '17 at 12:26

cs95 · Accepted Answer · 2017-11-27T12:15:51.803

If order isn't important, you can call np.sort along the first axis.

df = df.set_index('Index')   # ignore if `Index` already is the index

pd.DataFrame(np.sort(df.values, axis=1)[:, :4], 
           columns=np.arange(1, 5)).add_prefix('Col')

     Col1     Col2     Col3     Col4
0  89.123  100.000  100.000  100.000
1  92.342   98.300  100.000  101.530
2  97.602   98.713  100.230      NaN
3  93.473   98.123  102.060      NaN
4  94.529   94.529  102.205  107.755

This is much faster than my second solution, so if this is possible, definitely consider this.

If order matters, call sorted + apply and take the first 4 columns of your result.

df.apply(sorted, key=np.isnan, axis=1).iloc[:, :4]

               Col1     Col2     Col3    Col4
Index                                        
1991-12-31  100.000  100.000  100.000  89.123
1992-01-31   98.300  101.530  100.000  92.342
1992-02-29  100.230   98.713   97.602     NaN
1992-03-31  102.060   93.473   98.123     NaN
1992-04-30  102.205  107.755   94.529  94.529

Timings
Here are timings for just my answers -

df = pd.concat([df] * 10000, ignore_index=1)

%timeit df.apply(sorted, key=np.isnan, axis=1).iloc[:, :4]
1 loop, best of 3: 8.45 s per loop

pd.DataFrame(np.sort(df.values, axis=1)[:, :4], 
           columns=np.arange(1, 5)).add_prefix('Col')    
100 loops, best of 3: 4.76 ms per loop

Also timings from two more functions from @ Divakar's answer — Bharath M Shetty, Nov 27 '17 at 12:09
@Bharath Feel free to edit Divakar's or my answer (bit busy atm) — cs95, Nov 27 '17 at 12:12

jezrael · Answer 3 · 2017-11-27T12:22:28.927

2

You can use:

#if necessary
#df = df.set_index('Index')

df = df.apply(lambda x: pd.Series(x.dropna().values), axis=1).iloc[:, :4]
print (df)
                  0        1        2       3
Index                                        
1991-12-31  100.000  100.000  100.000  89.123
1992-01-31   98.300  101.530  100.000  92.342
1992-02-29  100.230   98.713   97.602     NaN
1992-03-31  102.060   93.473   98.123     NaN
1992-04-30  102.205  107.755   94.529  94.529

Or for better performance use numpy - working with requirement there are at least 4 non values per row:

a = df.values
df = pd.DataFrame(a[~np.isnan(a)].reshape(a.shape[0],-1)[:, :4], index=df.index)

Timings:

        Index   Col1     Col2     Col3    Col4    Col5  Col6  Col7  Col8
0  1991-12-31  100.0  100.000  100.000  89.123     NaN   NaN   NaN   NaN
1  1992-01-31   98.3  101.530  100.000     NaN  92.342   NaN   NaN   NaN
2  1992-02-29    NaN  100.230   98.713  97.602     NaN   NaN   NaN   1.0
3  1992-03-31    NaN      NaN  102.060  93.473  98.123   NaN   NaN   1.0
4  1992-04-30    NaN  102.205  107.755  94.529  94.529   NaN   NaN   NaN

df = df.set_index('Index')

df = pd.concat([df] * 10000, ignore_index=1)

In [260]: %timeit pd.DataFrame(justify(df.values, invalid_val=np.nan, axis=1, side='left')[:,:4])
100 loops, best of 3: 6.78 ms per loop

In [261]: %%timeit a = df.values
     ...: pd.DataFrame(a[~np.isnan(a)].reshape(a.shape[0],-1)[:, :4], index=df.index)
     ...: 
100 loops, best of 3: 2.11 ms per loop

In [262]: %timeit pd.DataFrame(np.sort(df.values, axis=1)[:, :4], columns=np.arange(1, 5)).add_prefix('Col')
100 loops, best of 3: 5.28 ms per loop

In [263]: %timeit pd.DataFrame(mask_app(df.values)[:,:4])
100 loops, best of 3: 8.68 ms per loop

edited Nov 27 '17 at 12:22

answered Nov 27 '17 at 11:46

jezrael

822,522
95
1,334
1,252

2

I assume OP would want to keep only 4 columns even if more than 4 columns are not null. – John Zwinck Nov 27 '17 at 11:49
2

This gives `ValueError: cannot reshape array of size 18 into shape (5,newaxis)`... can you check it again please? – cs95 Nov 27 '17 at 12:02
@cᴏʟᴅsᴘᴇᴇᴅ - I add some another values, because OP guaraanteed at least 4 non nan values. – jezrael Nov 27 '17 at 12:03
I don't know if you can assume that, since I didn't see OP mention it anywhere... am I mistaken? – cs95 Nov 27 '17 at 12:05
Yes, check first sentence in question. – jezrael Nov 27 '17 at 12:05
Can you provide your setup as well for these timings? Obviously your setup is different, since it just errors out with mine. – cs95 Nov 27 '17 at 12:19
It is really easy, only added 1 to `3.` and `4.` row to end. – jezrael Nov 27 '17 at 12:20
Added input DataFrame used for timings. – jezrael Nov 27 '17 at 12:23
I don't think your answer is correct. For example, if a row has more than 4 values, this will be incorrect. – cs95 Nov 27 '17 at 12:25
@jezrael: All answers are super nice. I opted for COLDSPEED's but it's really subjective. – user189035 Nov 27 '17 at 12:25
Your example works _only_ because, for your input, all rows have exactly 4 non null values. – cs95 Nov 27 '17 at 12:25
@COLDSPEED: I agree, your code seems a bit safer in that it can handle non conforming input. – user189035 Nov 27 '17 at 12:27
@cᴏʟᴅsᴘᴇᴇᴅ - so sorry, maybe is necessary add better test. You win this question, so congrat... – jezrael Nov 27 '17 at 12:29

Keep the first n non NaN cells in each row of a pandas DataFrame

Edit:

3 Answers3

Linked