Find the missing values in data frame using python

Question

     A1   A2  A3   B1   B2   B3
aa  1.0  4.0   9  NaN  NaN  2.0
bb  NaN  NaN   9  9.0  2.0  2.0
cc  3.0  2.0   7  1.0  3.0  2.0
dd  4.0  5.0   7  NaN  4.0  5.0
ee  5.0  NaN   1  5.0  1.0  5.0
ff  3.0  5.0   2  5.0  2.0  NaN

I need to fill the missing values (Na) based on the following parameters:

if the row(aa, bb, cc, dd, ff) has a missing value for either of columns(A1, A2, A3 or B1, B2, B3), then fill the missing value with the mean of the values of the other two rows of either (A1-3 or B1-3) example: (dd:B1) should be (5+4)/2 = 4.5
if the rows have two missing values belong to either (A1-A3 or B1-B3) then fill it with the available value in the third column of the category(A or B) example: (bb:A1 and bb: A2) == 9 as (bb: A3 is 9)

Expected output:

     A1   A2   A3   B1   B2   B3
aa  1.0  4.0  9.0  2.0  2.0  2.0
bb  9.0  9.0  9.0  9.0  2.0  2.0
cc  3.0  2.0  7.0  1.0  3.0  2.0
dd  4.0  5.0  7.0  4.5  4.0  5.0
ee  5.0  3.0  1.0  5.0  1.0  5.0
ff  3.0  5.0  2.0  5.0  2.0  3.5

Could please add the output as code instead of image, followed by the desired output if posible? — Celius Stingher, Sep 16 '19 at 20:10
Please look into https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples on how to make good examples — niraj, Sep 16 '19 at 20:17

Scott Boston · Accepted Answer · 2019-09-17T14:15:06.737

Here is one-way:

df.replace({'Na':np.nan}).T\
  .groupby(df.columns.str[0], group_keys=False)\
  .apply(lambda x: x.fillna(x.mean())).T

Output:

     A1   A2   A3   B1   B2   B3
aa  1.0  4.0  9.0  2.0  2.0  2.0
bb  9.0  9.0  9.0  9.0  2.0  2.0
cc  3.0  2.0  7.0  1.0  3.0  2.0
dd  4.0  5.0  7.0  4.5  4.0  5.0
ee  5.0  3.0  1.0  5.0  1.0  5.0
ff  3.0  5.0  2.0  5.0  2.0  3.5

Details:

First, in the original dataframe, there were string 'Na' instead of np.nan. Use replace with a dictionary to change 'Na' ton np.Nan, we we can use fillna.

Transpose the dataframe such that columns become rows and rows becomes columns, it is easier and you have more functionality with grouping on rows instead of columns. Use, the first letter of the column header to group on by using the .str accessor and slicing position 0. Next, we ca use fillna a with the mean of the x values.

it will be quite helpful if you could explain the process as well — sp29, Sep 16 '19 at 21:33

ansev · Answer 2 · 2019-09-16T21:44:03.970

first you maybe need:

df=df.replace({'Na':np.nan})

now df is:

     A1   A2  A3   B1   B2   B3
aa  1.0  4.0   9  NaN  NaN  2.0
bb  NaN  NaN   9  9.0  2.0  2.0
cc  3.0  2.0   7  1.0  3.0  2.0
dd  4.0  5.0   7  NaN  4.0  5.0
ee  5.0  NaN   1  5.0  1.0  5.0
ff  3.0  5.0   2  5.0  2.0  NaN

Then you can use:

c=df.columns.str.contains('A')
df2=df.T.groupby(c).mean().T
df2.columns=['B','A']
df=pd.concat([df.T[c].fillna(df2['A']),df.T[~c].fillna(df2['B'])]).T
print(df)


     A1   A2   A3   B1   B2   B3
aa  1.0  4.0  9.0  2.0  2.0  2.0
bb  9.0  9.0  9.0  9.0  2.0  2.0
cc  3.0  2.0  7.0  1.0  3.0  2.0
dd  4.0  5.0  7.0  4.5  4.0  5.0
ee  5.0  3.0  1.0  5.0  1.0  5.0
ff  3.0  5.0  2.0  5.0  2.0  3.5

Find the missing values in data frame using python

2 Answers2