Condensing pandas dataframe by dropping missing elements

Question

Problem

I have a dataframe that looks like this:

Key Var    ID_1 Var_1   ID_2 Var_2  ID_3 Var_3
1   True   1.0  True    NaN  NaN    5.0  True
2   True   NaN  NaN     4.0  False  7.0  True
3   False  2.0  False   5.0  True   NaN  NaN

Each row has exactly 2 non-null sets of data (ID/Var), and the remaining third is guaranteed to be null. What I want to do is "condense" the dataframe by removing the missing elements.

Desired Output

Key Var     First_ID    First_Var   Second_ID   Second_Var
1   True    1           True        5           True
2   True    4           False       7           True
3   False   2           False       5           True

The ordering is not important, so long as the Id/Var pairs are maintained.

Current Solution

Below is a working solution that I have:

import pandas as pd
import numpy as np

data = pd.DataFrame({'Key': [1, 2, 3], 'Var': [True, True, False], 'ID_1':[1, np.NaN, 2],
                    'Var_1': [True, np.NaN, False], 'ID_2': [np.NaN, 4, 5], 'Var_2': [np.NaN, False, True],
                    'ID_3': [5, 7, np.NaN], 'Var_3': [True, True, np.NaN]})

sorted_columns = ['Key', 'Var', 'ID_1', 'Var_1', 'ID_2', 'Var_2', 'ID_3', 'Var_3']
data = data[sorted_columns]

output = np.empty(shape=[data.shape[0], 6], dtype=str)

for i, *row in data.itertuples():
    output[i] = [element for element in row if np.isfinite(element)]

print(output)


[['1' 'T' '1' 'T' '5' 'T']
 ['2' 'T' '4' 'F' '7' 'T']
 ['3' 'F' '2' 'F' '5' 'T']]

This is acceptable, but not ideal. I can live with not having the column names, but my big issue is having to cast the data inside the array into a string in order to avoid my booleans being converted to numeric.

Are there other solutions that do a better job at preserving the data? Bonus points if the result is a pandas dataframe.

Bharath M Shetty · Accepted Answer · 2017-09-20T16:22:44.867

3

There is one simple solution i.e push the nans to right and drop the nans on axis 1. i.e

ndf = data.apply(lambda x : sorted(x,key=pd.isnull),1).dropna(1)

Output:

  Key    Var ID_1  Var_1 ID_2 Var_2
0   1   True    1   True    5  True
1   2   True    4  False    7  True
2   3  False    2  False    5  True

Hope it helps.

A numpy solution from Divakar here for 10x speed i.e

def mask_app(a):
    out = np.full(a.shape,np.nan,dtype=a.dtype)
    mask = ~np.isnan(a.astype(float))
    out[np.sort(mask,1)[:,::-1]] = a[mask]
    return out

ndf = pd.DataFrame(mask_app(data.values),columns=data.columns).dropna(1)

  Key    Var ID_1  Var_1 ID_2 Var_2
0   1   True    1   True    5  True
1   2   True    4  False    7  True
2   3  False    2  False    5  True

edited Sep 20 '17 at 16:22

answered Sep 19 '17 at 16:09

Bharath M Shetty

30,075
6
57
108

1

Thats pretty smart +1 – 2Obe Sep 19 '17 at 16:22
This works -- it isn't as fast as the numpy solution but it does preserve the data's original format. – ZachTurn Sep 20 '17 at 15:09
It would have been nice if you add numpy tag to the question so numpy experts might come up with faster answer . – Bharath M Shetty Sep 20 '17 at 15:11
@ZachTurn well there you go with a numpy solution with data preserved. – Bharath M Shetty Sep 20 '17 at 16:09

Condensing pandas dataframe by dropping missing elements

1 Answers1

Linked