0

I have a data table where missing values (both single and consecutive) can occur within each group. I would like to fill them in as follows: calculate the average of the 3 values to the left of the first NaN in the sequence, then calculate the average of the 3 values to the right of the last NaN in the sequence, and then interpolate the NaNs between these averages.

+-------+-------+
| group | value |
+-------+-------+
| 1     | 1     |
+-------+-------+
| 1     | 1     |
+-------+-------+
| 1     | 2     |
+-------+-------+
| 1     | 3     |
+-------+-------+
| 1     | 4     |
+-------+-------+
| 1     | NaN   |
+-------+-------+
| 1     | NaN   |
+-------+-------+
| 1     | 3     |
+-------+-------+
| 1     | 6     |
+-------+-------+
| 1     | 4     |
+-------+-------+
| 1     | 3     |
+-------+-------+
| 1     | NaN   |
+-------+-------+
| 2     | NaN   |
+-------+-------+
| 2     | NaN   |
+-------+-------+
| 2     | 1     |
+-------+-------+
| 2     | 2     |
+-------+-------+
| 2     | 3     |
+-------+-------+
| 2     | 4     |
+-------+-------+
| 2     | NaN   |
+-------+-------+
| 2     | NaN   |
+-------+-------+
| 2     | NaN   |
+-------+-------+
| 2     | 6     |
+-------+-------+
| 2     | 8     |
+-------+-------+
| 2     | 9     |
+-------+-------+

code to reproduce dataframe above

nan = np.nan
d = {'group': {0: 1,
  1: 1,
  2: 1,
  3: 1,
  4: 1,
  5: 1,
  6: 1,
  7: 1,
  8: 1,
  9: 1,
  10: 1,
  11: 1,
  12: 2,
  13: 2,
  14: 2,
  15: 2,
  16: 2,
  17: 2,
  18: 2,
  19: 2,
  20: 2,
  21: 2,
  22: 2,
  23: 2},
 'value': {0: 1.0,
  1: 1.0,
  2: 2.0,
  3: 3.0,
  4: 4.0,
  5: nan,
  6: nan,
  7: 3.0,
  8: 6.0,
  9: 4.0,
  10: 3.0,
  11: nan,
  12: nan,
  13: nan,
  14: 1.0,
  15: 2.0,
  16: 3.0,
  17: 4.0,
  18: nan,
  19: nan,
  20: nan,
  21: 6.0,
  22: 8.0,
  23: 9.0}}

df = pd.DataFrame(d)

Expected output:

d = {'group': {0: 1,
  1: 1,
  2: 1,
  3: 1,
  4: 1,
  5: 1,
  6: 1,
  7: 1,
  8: 1,
  9: 1,
  10: 1,
  11: 1,
  12: 2,
  13: 2,
  14: 2,
  15: 2,
  16: 2,
  17: 2,
  18: 2,
  19: 2,
  20: 2,
  21: 2,
  22: 2,
  23: 2},
 'value': {0: 1.0,
  1: 1.0,
  2: 2.0,
  3: 3.0,
  4: 4.0,
  5: 3.44444444,
  6: 3.88888889,
  7: 3.0,
  8: 6.0,
  9: 4.0,
  10: 3.0,
  11: 4.333333,
  12: 2.0,
  13: 2.0,
  14: 1.0,
  15: 2.0,
  16: 3.0,
  17: 4.0,
  18: 4.166667,
  19: 5.333333,
  20: 6.500000,
  21: 6.0,
  22: 8.0,
  23: 9.0}}

Is it possible to do this in pandas, without using a loop?

Fissium
  • 321
  • 2
  • 11
  • You can check the interpolate function options in pandas and groupby function together. You might need this as well: https://stackoverflow.com/questions/37057187/pandas-interpolate-within-a-groupby –  Jun 30 '21 at 14:00

1 Answers1

1

IIUC, here’s one way:

df['updated_values'] = (
    df.groupby('group')
    .apply(
        lambda x: x['value'].fillna(
            x['value']
            .rolling(3)
            .mean()
            .bfill()
            .where(~x['value'].isna())
            .interpolate()
            .bfill()
            .ffill()
            )
    ).values
)

OUTPUT:

    group  value  updated_values
0       1    1.0        1.000000
1       1    1.0        1.000000
2       1    2.0        2.000000
3       1    3.0        3.000000
4       1    4.0        4.000000
5       1    NaN        3.444444
6       1    NaN        3.888889
7       1    3.0        3.000000
8       1    6.0        6.000000
9       1    4.0        4.000000
10      1    3.0        3.000000
11      1    NaN        4.333333
12      2    NaN        2.000000
13      2    NaN        2.000000
14      2    1.0        1.000000
15      2    2.0        2.000000
16      2    3.0        3.000000
17      2    4.0        4.000000
18      2    NaN        4.166667
19      2    NaN        5.333333
20      2    NaN        6.500000
21      2    6.0        6.000000
22      2    8.0        8.000000
23      2    9.0        9.000000
Nk03
  • 14,699
  • 2
  • 8
  • 22
  • 1
    The first method does not fill in the missing values if they are at the beginning of the group, for example, group 2 has two NaNs at the beginning, and they also remain. In this case, they need to be replaced by the average of <=3 on the right. – Fissium Jun 30 '21 at 14:25
  • 1
    @Rabbit It'll be helpful if you can also add the expected ouput. – Nk03 Jun 30 '21 at 14:27