Problem
I have a dataframe that looks like this:
Key Var ID_1 Var_1 ID_2 Var_2 ID_3 Var_3
1 True 1.0 True NaN NaN 5.0 True
2 True NaN NaN 4.0 False 7.0 True
3 False 2.0 False 5.0 True NaN NaN
Each row has exactly 2 non-null sets of data (ID/Var), and the remaining third is guaranteed to be null. What I want to do is "condense" the dataframe by removing the missing elements.
Desired Output
Key Var First_ID First_Var Second_ID Second_Var
1 True 1 True 5 True
2 True 4 False 7 True
3 False 2 False 5 True
The ordering is not important, so long as the Id/Var pairs are maintained.
Current Solution
Below is a working solution that I have:
import pandas as pd
import numpy as np
data = pd.DataFrame({'Key': [1, 2, 3], 'Var': [True, True, False], 'ID_1':[1, np.NaN, 2],
'Var_1': [True, np.NaN, False], 'ID_2': [np.NaN, 4, 5], 'Var_2': [np.NaN, False, True],
'ID_3': [5, 7, np.NaN], 'Var_3': [True, True, np.NaN]})
sorted_columns = ['Key', 'Var', 'ID_1', 'Var_1', 'ID_2', 'Var_2', 'ID_3', 'Var_3']
data = data[sorted_columns]
output = np.empty(shape=[data.shape[0], 6], dtype=str)
for i, *row in data.itertuples():
output[i] = [element for element in row if np.isfinite(element)]
print(output)
[['1' 'T' '1' 'T' '5' 'T']
['2' 'T' '4' 'F' '7' 'T']
['3' 'F' '2' 'F' '5' 'T']]
This is acceptable, but not ideal. I can live with not having the column names, but my big issue is having to cast the data inside the array into a string in order to avoid my booleans being converted to numeric.
Are there other solutions that do a better job at preserving the data? Bonus points if the result is a pandas dataframe.