I am currently cleaning a rather large file of time series. As you can see in the dataframe below, the majority of columns begin and end with some NaNs.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'a': [np.NaN, np.NaN, 3, 4, 5, 3, 2, 1, 2, 1, np.NaN, np.NaN],
'b': [np.NaN, 80, 84, 30, 3, np.NaN, np.NaN, np.NaN, 4, 3, 2, 1],
'c': [np.NaN, np.NaN, np.NaN, np.NaN, np.NaN, 80, 84, 1, 2, 3, 4 , 5],
'd': [np.NaN, 40, 8, 2, 3, 4, 5, 6, 7, 8, 7, np.NaN],
'e': [np.NaN, 1, 2, 3, np.NaN, np.NaN, 6, 7, 8, 9, 1 ,2]})
This isn't a problem for me, but: If there is a series of 2 following NaN's INSIDE the first real value and the last value of a column, I need to detect them. So what I want to do is very similiar to the idea / thinking behind the interpolation method:
df = df.interpolate(method = 'linear', limit_area='inside', limit = 2)
But instead I would like to get the columns in which there are these recurring NaN series. So looking at the example dataframe above, I would like column "b" and "e" as the output, because those are the only columns who have more than two recurring NaNs INBETWEEN the first and last real, non-NaN value. So I am not looking for a interpolation method, but a method for detecting these columns
Does anyone have a recommendation how this could be done? Thanks in advance