27

I am trying to impute/fill values using rows with similar columns' values.

For example, I have this dataframe:

one | two | three
1      1     10
1      1     nan
1      1     nan
1      2     nan
1      2     20
1      2     nan
1      3     nan
1      3     nan

I wanted to using the keys of column one and two which is similar and if column three is not entirely nan then impute the existing value from a row of similar keys with value in column '3'.

Here is my desired result:

one | two | three
1      1     10
1      1     10
1      1     10
1      2     20
1      2     20
1      2     20
1      3     nan
1      3     nan

You can see that keys 1 and 3 do not contain any value because the existing value does not exists.

I have tried using groupby+fillna():

df['three'] = df.groupby(['one','two'])['three'].fillna()

which gave me an error.

I have tried forward fill which give me rather strange result where it forward fill the column 2 instead. I am using this code for forward fill.

df['three'] = df.groupby(['one','two'], sort=False)['three'].ffill()
smci
  • 32,567
  • 20
  • 113
  • 146
Phurich.P
  • 1,376
  • 5
  • 18
  • 33

2 Answers2

69

If only one non NaN value per group use ffill (forward filling) and bfill (backward filling) per group, so need apply with lambda:

df['three'] = df.groupby(['one','two'], sort=False)['three']
                .apply(lambda x: x.ffill().bfill())
print (df)
   one  two  three
0    1    1   10.0
1    1    1   10.0
2    1    1   10.0
3    1    2   20.0
4    1    2   20.0
5    1    2   20.0
6    1    3    NaN
7    1    3    NaN

But if multiple value per group and need replace NaN by some constant - e.g. mean by group:

print (df)
   one  two  three
0    1    1   10.0
1    1    1   40.0
2    1    1    NaN
3    1    2    NaN
4    1    2   20.0
5    1    2    NaN
6    1    3    NaN
7    1    3    NaN

df['three'] = df.groupby(['one','two'], sort=False)['three']
                .apply(lambda x: x.fillna(x.mean()))
print (df)
   one  two  three
0    1    1   10.0
1    1    1   40.0
2    1    1   25.0
3    1    2   20.0
4    1    2   20.0
5    1    2   20.0
6    1    3    NaN
7    1    3    NaN
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • That was exactly what I needed. Thank you! – Walt Reed Nov 08 '17 at 16:06
  • 2
    @jezrael: is there any reason that force to use `apply` in your answer? I am asking because I tried direct `ffill` and `bfill` and it returns correct result: `df['three'] = df.groupby(['one', 'two'])['three'].ffill().bfill()` – Andy L. Apr 22 '19 at 22:48
  • 3
    @Andy L. It working correct, because last group is only NaN group. If change sample data for first only NaN group (10 to NaN) , your solution failed. Reason is last bfill working not per groups, but per Series returned groupby +ffill. – jezrael Apr 23 '19 at 03:57
  • 1
    ah, I forgot that the `bfill` back-fills the output series from `ffill`, not the `groupby`. Thanks for answers – Andy L. Apr 23 '19 at 04:34
  • 1
    May I ask, how can I apply `df['three'] = df.groupby(['one','two'], sort=False)['three'].apply(lambda x: x.ffill().bfill())` to multiple columns `three, four, five, etc` instead of only `three` which need groupby `one` and `two` and fillna? – ah bon Dec 23 '19 at 08:17
  • 2
    @ahbon - Use `cols = ['three','four','five']` and `df[cols] = df.groupby(['one','two'], sort=False)[cols].apply(lambda x: x.ffill().bfill())` – jezrael Dec 23 '19 at 08:20
  • This solution may not work as intended if the grouping columns include `NaN` values. How to fix it? One idea is to `df['four'] = df.groupby(['one','two'], sort=False)['three'].apply(lambda x: x.ffill().bfill())` and then `df.loc[df.four.isnull(),'four']=df.three` Is there a better way? – volkan g May 30 '22 at 12:29
  • @volkang - `This solution may not work as intended if the grouping columns include NaN values` - it means columns `one, two` ? Because if you think `three` it working well, see answer - last group `1,3` - only NaNs per groups ouput is `NaN`s. – jezrael May 30 '22 at 12:33
  • Yes I meant `one` or `two` – volkan g May 30 '22 at 12:35
  • @volkang - check [this](https://stackoverflow.com/questions/18429491/pandas-groupby-columns-with-nan-missing-values) – jezrael May 30 '22 at 12:36
  • I tried adding `dropna=False` in the grouper, but seems not to work. Could you try on this `df` for example? `data = { 'one': [1,1,1,1,1,1,1,1,np.nan,99], 'two': [1,1,1,2,2,2,3,3,99,np.nan], 'three':[10,np.nan,np.nan,np.nan,20,np.nan,np.nan,np.nan,999,888] } df=pd.DataFrame(data)` – volkan g May 30 '22 at 12:41
  • @volkang - I guess your pandas version is <1.1, so need [this](https://stackoverflow.com/a/39857303/2901002) solution. – jezrael May 30 '22 at 12:43
  • @jezrael Nope. My pandas version is '1.3.1'. However, I tried changing columns to `string` and use `dropna=False`. With that, it works. Thanks. You wanna add this to your answer to make it more comprehensive? – volkan g May 30 '22 at 12:52
2

You can sort data by the column with missing values then groupby and forwardfill:

df.sort_values('three', inplace=True)
df['three'] = df.groupby(['one','two'])['three'].ffill()
Mykola Zotko
  • 15,583
  • 3
  • 71
  • 73