Fill nan gaps in pandas df only if gaps smaller than N nans

Question

I am working with a pandas data frame that contains also nan values. I want to substitute the nans with interpolated values with df.interpolate, but only if the length of the sequence of nan values is =<N. As an example, let's assume that I choose N = 2 (so I want to fill in sequences of nans if they are up to 2 nans long) and I have a dataframe with

print(df)
A   B   C
1   1   1
nan nan 2
nan nan 3
nan 4   nan
5   5   5

In such a case I want to apply a function on df that only the nan sequences with length N<=2 get filled, but the larger sequences get untouched, resulting in my desired output of

print(df)
A   B   C
1   1   1
nan 2   2
nan 3   3
nan 4   4
5   5   5

Note that I am aware of the option of limit=N inside df.interpolate, but it doesn't fulfil what I want, because it would fill any length of nan sequence, just limit the filling to a the first 3 nans resulting in the undesired output

print(df)
A   B   C
1   1   1
2   2   2
3   3   3
nan 4   4
5   5   5

So do you know of a function/ do you know how to construct a code that results in my desired output? Tnx

You dont need any apply methods. Simply find the column names which agree with your condition and then use `df[cols] = df[cols].interpolate()` to interpolate and overwrite them. Check my 2 liner for reference. — Akshay Sehgal, Sep 12 '21 at 21:16

d.b · Accepted Answer · 2021-09-13T14:42:17.353

3

You can perform run length encoding and identify the runs of NaN that are shorter than or equal to two elements for each columns. One way to do that is to use get_id from package pdrle (disclaimer: I wrote it).

import pdrle


chk = df.isna() & (df.apply(lambda x: x.groupby(pdrle.get_id(x)).transform(len)) <= 2)
df[chk] = df.interpolate()[chk]
#      A    B    C
# 0  1.0  1.0  1.0
# 1  NaN  2.0  2.0
# 2  NaN  3.0  3.0
# 3  NaN  4.0  4.0
# 4  5.0  5.0  5.0

edited Sep 13 '21 at 14:42

answered Sep 12 '21 at 21:13

d.b

32,245
6
36
77

Awesome, I like how it is short and it doesn't use explicit iteration in a for-loop. But can you tell a little bit about the `pdrle` package and in a 1-2 sentences what is happening in the one line of code `chk = ....`? Afterwards I can accept this as the answer to my question :) thanks – NeStack Sep 13 '21 at 10:12
Also, the code runs quite slow - ~10 min for my 130000 rows in 10 columns. I think it is due to the `.groupby`. Do you think there is a way to speed up the code? tnx – NeStack Sep 13 '21 at 10:59
Would `cython` or `numba` have an effect on the `.groupby` function applied to a dataframe? Can you give an example how this amendment should look like? – NeStack Sep 13 '21 at 15:16

Andrej Kesely · Answer 2 · 2021-09-12T20:49:14.397

Try:

N = 2
df_interpolated = df.interpolate()

for c in df:
    mask = df[c].isna()
    x = (
        mask.groupby((mask != mask.shift()).cumsum()).transform(
            lambda x: len(x) > N
        )
        * mask
    )
    df_interpolated[c] = df_interpolated.loc[~x, c]

print(df_interpolated)

Prints:

     A    B    C
0  1.0  1.0  1.0
1  NaN  2.0  2.0
2  NaN  3.0  3.0
3  NaN  4.0  4.0
4  5.0  5.0  5.0

Trying with different df:

     A    B    C
0  1.0  1.0  1.0
1  NaN  NaN  2.0
2  NaN  NaN  3.0
3  NaN  4.0  NaN
4  5.0  5.0  5.0
5  NaN  5.0  NaN
6  NaN  5.0  NaN
7  8.0  5.0  NaN

produces:

     A    B    C
0  1.0  1.0  1.0
1  NaN  2.0  2.0
2  NaN  3.0  3.0
3  NaN  4.0  4.0
4  5.0  5.0  5.0
5  6.0  5.0  NaN
6  7.0  5.0  NaN
7  8.0  5.0  NaN

Akshay Sehgal · Answer 3 · 2021-09-12T21:18:40.613

0

You can try the following -

n=2
cols = df.columns[df.isna().sum()<=n]
df[cols]  = df[cols].interpolate()
df

     A    B    C
0  1.0  1.0  1.0
1  NaN  2.0  2.0
2  NaN  3.0  3.0
3  NaN  4.0  4.0
4  5.0  5.0  5.0

df.columns[df.isna().sum()<=n] filters the columns based on your condition. Then, you simply overwrite the columns after interpolation.

edited Sep 12 '21 at 21:18

answered Sep 12 '21 at 21:13

Akshay Sehgal

18,741
3
21
51

Thanks! But my actual `df` is different from the dummy I show as an example - the real `df` is ~130 000 rows long and each column contains nan sequences of various length, some longer than N=2, some shorter. And I want in all the columns to fill sequences up to N long, so your answer will not do it. But still thanks :) – NeStack Sep 13 '21 at 09:55

Fill nan gaps in pandas df only if gaps smaller than N nans

3 Answers3

Linked