Given a df as shown below, and assume the value under column lapse
is unique and range from 0 to 18. However, some of the values is not available within this range. For this example, the value 0
,16
and 18
is missing.
lapse (a, i) (a, j) (b, k) c
0 2.0 0.423655 0.645894 0.437587 0.891773
1 4.0 0.963663 0.383442 0.791725 0.528895
2 6.0 0.568045 0.925597 0.071036 0.087129
3 8.0 0.020218 0.832620 0.778157 0.870012
4 10.0 0.978618 0.799159 0.461479 0.780529
5 12.0 0.118274 0.639921 0.143353 0.944669
6 14.0 0.521848 0.414662 0.264556 0.774234
The objective is to create a surrogate rows of these is missing value, and append it to the original df
. Such that, the output should be as below
lapse (a, i) (a, j) (b, k) c
0 0.0 NaN NaN NaN NaN
0 2.0 0.423655 0.645894 0.437587 0.891773
1 4.0 0.963663 0.383442 0.791725 0.528895
2 6.0 0.568045 0.925597 0.071036 0.087129
3 8.0 0.020218 0.832620 0.778157 0.870012
4 10.0 0.978618 0.799159 0.461479 0.780529
5 12.0 0.118274 0.639921 0.143353 0.944669
6 14.0 0.521848 0.414662 0.264556 0.774234
1 16.0 NaN NaN NaN NaN
2 18.0 NaN NaN NaN NaN
The following code snippet is able to answer the above objective. However, in actual implementation, the data frame is of bigger magnitude and I am wonder whether there is better way , or pandas built-in of doing this?
Line to generate to original df
import numpy as np
import pandas as pd
nshape=5
increment=2
max_val=20
np.random.seed(0)
aran=np.arange(0,max_val,increment).astype(int)
nshape=aran.shape[0]
arr=np.concatenate((aran.reshape(-1,1), np.random.random((nshape,4))), axis=1)
# Extracted only selected, other non selected index are assume case to solve
idx_available=[3, 5, 4, 2, 1, 7, 6]
df=pd.DataFrame(arr[sorted(idx_available),:],columns=['lapse',('a','i'),('a','j'),('b','k'),'c'])
Proposed solution
name_other=[i for i in df.columns.tolist() if i!='lapse']
lapse_available=df['lapse'].to_numpy()
lapse_not_available = np.setdiff1d(aran,lapse_available)
an_array = np.empty((len(lapse_not_available),len(name_other)))
an_array[:] = np.NaN
arr2=np.concatenate((lapse_not_available.reshape(-1,1), an_array), axis=1)
df2=pd.DataFrame(arr2,columns=['lapse']+name_other)
df=pd.concat([df,df2],axis=0).sort_values(by=['lapse'])