Create surrogate rows in Pandas based on missing condition

Question

Given a df as shown below, and assume the value under column lapse is unique and range from 0 to 18. However, some of the values is not available within this range. For this example, the value 0,16 and 18 is missing.

   lapse    (a, i)    (a, j)    (b, k)         c
0    2.0  0.423655  0.645894  0.437587  0.891773
1    4.0  0.963663  0.383442  0.791725  0.528895
2    6.0  0.568045  0.925597  0.071036  0.087129
3    8.0  0.020218  0.832620  0.778157  0.870012
4   10.0  0.978618  0.799159  0.461479  0.780529
5   12.0  0.118274  0.639921  0.143353  0.944669
6   14.0  0.521848  0.414662  0.264556  0.774234

The objective is to create a surrogate rows of these is missing value, and append it to the original df. Such that, the output should be as below

   lapse    (a, i)    (a, j)    (b, k)         c
0    0.0       NaN       NaN       NaN       NaN
0    2.0  0.423655  0.645894  0.437587  0.891773
1    4.0  0.963663  0.383442  0.791725  0.528895
2    6.0  0.568045  0.925597  0.071036  0.087129
3    8.0  0.020218  0.832620  0.778157  0.870012
4   10.0  0.978618  0.799159  0.461479  0.780529
5   12.0  0.118274  0.639921  0.143353  0.944669
6   14.0  0.521848  0.414662  0.264556  0.774234
1   16.0       NaN       NaN       NaN       NaN
2   18.0       NaN       NaN       NaN       NaN

The following code snippet is able to answer the above objective. However, in actual implementation, the data frame is of bigger magnitude and I am wonder whether there is better way , or pandas built-in of doing this?

Line to generate to original df

import numpy as np
import pandas as pd

nshape=5
increment=2
max_val=20
np.random.seed(0)
aran=np.arange(0,max_val,increment).astype(int)
nshape=aran.shape[0]
arr=np.concatenate((aran.reshape(-1,1), np.random.random((nshape,4))), axis=1)

# Extracted only selected, other non selected index are assume case to solve
idx_available=[3, 5, 4, 2, 1, 7, 6]

df=pd.DataFrame(arr[sorted(idx_available),:],columns=['lapse',('a','i'),('a','j'),('b','k'),'c'])

Proposed solution

name_other=[i for i in df.columns.tolist() if i!='lapse']
lapse_available=df['lapse'].to_numpy()
lapse_not_available = np.setdiff1d(aran,lapse_available)

an_array = np.empty((len(lapse_not_available),len(name_other)))
an_array[:] = np.NaN
arr2=np.concatenate((lapse_not_available.reshape(-1,1), an_array), axis=1)
df2=pd.DataFrame(arr2,columns=['lapse']+name_other)
df=pd.concat([df,df2],axis=0).sort_values(by=['lapse'])

score 3 · Accepted Answer · answered Dec 16 '21 at 12:01

You can also use:

df.set_index('lapse', inplace=True)
df = df.reindex(np.arange(0,20,2)).reset_index()

OUTPUT

   lapse    (a, i)    (a, j)    (b, k)         c
0      0       NaN       NaN       NaN       NaN
1      2  0.423655  0.645894  0.437587  0.891773
2      4  0.963663  0.383442  0.791725  0.528895
3      6  0.568045  0.925597  0.071036  0.087129
4      8  0.020218  0.832620  0.778157  0.870012
5     10  0.978618  0.799159  0.461479  0.780529
6     12  0.118274  0.639921  0.143353  0.944669
7     14  0.521848  0.414662  0.264556  0.774234
8     16       NaN       NaN       NaN       NaN
9     18       NaN       NaN       NaN       NaN

I am accepting this answer since I can integrate seamlessly if the original df is `df=pd.DataFrame(arr[sorted(idx_available),:],columns=[('lapse',''),('a','i'),('a','j'),('b','k'),('c','')])`, which subsequently modify the solution as `df.set_index(('lapse',''), inplace=True)` — mpx, Dec 16 '21 at 12:34

score 2 · Answer 2 · answered Dec 16 '21 at 11:45

You can merge:

df.merge(pd.DataFrame({'lapse': np.arange(0,20,2)}), how='right')

Output:

   lapse    (a, i)    (a, j)    (b, k)         c
0    0.0       NaN       NaN       NaN       NaN
1    2.0  0.423655  0.645894  0.437587  0.891773
2    4.0  0.963663  0.383442  0.791725  0.528895
3    6.0  0.568045  0.925597  0.071036  0.087129
4    8.0  0.020218  0.832620  0.778157  0.870012
5   10.0  0.978618  0.799159  0.461479  0.780529
6   12.0  0.118274  0.639921  0.143353  0.944669
7   14.0  0.521848  0.414662  0.264556  0.774234
8   16.0       NaN       NaN       NaN       NaN
9   18.0       NaN       NaN       NaN       NaN

Note: merge or other operations based on __eq__ on floats, e.g. np.setdiff1d, must be carried on with care.

fair point on the float angle. Might be safer for OP to use ints — sammywemmy, Dec 16 '21 at 12:02
Thanks for the excellence answer @Quang Hoang. Say the original `df=pd.DataFrame(arr[sorted(idx_available),:],columns=[('lapse',''),('a','i'),('a','j'),('b','k'),('c','')])`, simply `df.merge(pd.DataFrame({('lapse',''): np.arange(0,max_val,increment)}), how='right')` throw an `ValueError: cannot join with no overlapping index names`. Any tips? — mpx, Dec 16 '21 at 12:25

score 0 · Answer 3 · answered Dec 16 '21 at 12:20

Referencig crono's answer you can use the following tools from pandas:

Index - To create the complete list of lapses your dataframe is going to use. Index doc

set_index - Set your column lapse as index to facilitate the dataframe transformation. set_index doc

reindex - The main part, to merge your dataframe with the complete list o lapses, reindex will automatically fill the missing rows with NaN and keep the ones that already exist. reindex doc

reset_index - Will return your lapse to a normal column instead of index. reset_index doc

Create surrogate rows in Pandas based on missing condition

3 Answers3