0

I am currently cleaning a rather large file of time series. As you can see in the dataframe below, the majority of columns begin and end with some NaNs.

import pandas as pd
import numpy as np

df = pd.DataFrame({
               'a': [np.NaN, np.NaN, 3, 4, 5, 3, 2, 1, 2, 1, np.NaN, np.NaN],
               'b': [np.NaN, 80, 84, 30, 3, np.NaN, np.NaN, np.NaN, 4, 3, 2, 1],
               'c': [np.NaN, np.NaN, np.NaN, np.NaN, np.NaN, 80, 84, 1, 2, 3, 4 , 5],
               'd': [np.NaN, 40, 8, 2, 3, 4, 5, 6, 7, 8, 7, np.NaN],
               'e': [np.NaN, 1, 2, 3, np.NaN, np.NaN, 6, 7, 8, 9, 1 ,2]})

This isn't a problem for me, but: If there is a series of 2 following NaN's INSIDE the first real value and the last value of a column, I need to detect them. So what I want to do is very similiar to the idea / thinking behind the interpolation method:

df = df.interpolate(method = 'linear', limit_area='inside', limit = 2)       

But instead I would like to get the columns in which there are these recurring NaN series. So looking at the example dataframe above, I would like column "b" and "e" as the output, because those are the only columns who have more than two recurring NaNs INBETWEEN the first and last real, non-NaN value. So I am not looking for a interpolation method, but a method for detecting these columns

Does anyone have a recommendation how this could be done? Thanks in advance

Sanoj
  • 301
  • 1
  • 16

1 Answers1

1

This can be solved using a method derived from the consecutive count problem

Here I would define a function to count the maximum numbers of consecutive NaN inside a Series:

def seqnan(x):
    y = x[~x.isna()]
    y = x[y.index[0]:y.index[-1]]      # limit from first non NaN value to last one
    # the magic formula (see ref. post for details)
    t = y[y.isna()].groupby((y.isna()&(~y.shift().isna())).cumsum()).cumcount().max()
    return 0 if np.isnan(t) else t+1

Now we have:

>>> df.apply(seqnan)
a    0
b    3
c    0
d    0
e    2
dtype: int64

So to get the columns names having at least 2 consecutive NaN values inside the first non NaN to last non NaN, you can do

tmp = df.apply(seqnan)
cols = tmp[tmp >= 2].index.tolist()

to get as expected

['b', 'e']
Serge Ballesta
  • 143,923
  • 11
  • 122
  • 252