3

I'm interested in forward filling both single and multiple values in a column in pandas. With the following dataframe:

import pandas as pd
df = pd.DataFrame([[1, 2, 3], [4, None, None], [None, None, 9]])
df
    0   1   2
0   1   2   3
1   4 NaN NaN
2 NaN NaN   9 

Forward fill will produce:

df = pd.DataFrame([[1, 2, 3], [4, None, None], [None, None, 9]])
df.fillna(method='ffill')
df
   0  1  2
0  1  2  3
1  4  2  3
2  4  2  9

However, I need an ffill-like method that will do this, or alternatively copy all above values if the values above follow one another:

df = pd.DataFrame([[1, 2, 3], [4, None, None], [None, 5, 9], [None,None,None])
df
    0   1   2
0   1   2   3
1   4 NaN NaN
2 NaN   5   9 
3 NaN NaN NaN

Resulting in:

df
    0   1   2
0   1   2   3
1   4   2   3
2   1   5   9 
3   4   5   9

Major edit: In my data the values will always be followed by NaNs in an unknown multiple of the length of values. Take df[0], for instance 1,4 would repeat for as long as there are NaNs. The only rule is that they will be a multiple of the length of the values (2)

  • how is maximal length of consecutive missing values in real data? – jezrael May 10 '22 at 10:51
  • @jezrael it's someone elses data structure the 'above' values allow me to reindex all the data so that it's comparable whereas relative values are provided usually (now NaNs) – mattmoore_bioinfo May 10 '22 at 10:57
  • 2
    is the pattern same for whole column? like in your example. will there be always same number of consecutive Nan's in one column and the other column only 1 Nan at a time ? or can these patterns mix in one column – Rabinzel May 10 '22 at 11:05
  • 1
    @Rabinzel they can mix in a column. The only rule is that the same number of NaNs will follow the number of non-NaNs above – mattmoore_bioinfo May 10 '22 at 11:11
  • Added a major edit to the above rule. The non-NaNs will be followed by a multiple of their length until another value is met. That's why I was thinking along the lines of an ffill-like method but later got it confused – mattmoore_bioinfo May 10 '22 at 12:34

1 Answers1

5

You can create consecutive values for missing and not missing values, then create counter per columns and forward filling missing values per groups:

df = pd.DataFrame([[1, 2, 3], [4, None, 8], [None, 5, 9], [None,None,10],
                   [0, 2, None], [5, None, None], [None, 5, None], [None,None,None]])


print (df)
     0    1     2
0  1.0  2.0   3.0
1  4.0  NaN   8.0
2  NaN  5.0   9.0
3  NaN  NaN  10.0
4  0.0  2.0   NaN
5  5.0  NaN   NaN
6  NaN  5.0   NaN
7  NaN  NaN   NaN

m = df.isna()
g = m.ne(m.shift()).cumsum()
for c in df.columns:
    df[c] = df.groupby(g.groupby(c).cumcount())[c].ffill()

print (df)
     0    1     2
0  1.0  2.0   3.0
1  4.0  2.0   8.0
2  1.0  5.0   9.0
3  4.0  5.0  10.0
4  0.0  2.0   3.0
5  5.0  2.0   8.0
6  0.0  5.0   9.0
7  5.0  5.0  10.0

EDIT: New solution repeat non missing values by newxt missing values per groups creted by first non missing value, here is used numpy.tile for generate sequences:

df = pd.DataFrame([[1, 2, 3], [4, None, 8], [None, 5, 9], [7,None,10],
                   [0, 2, None], [5, None, None], [None, 6, None], [None,8,None]
                   , [None,None,None], [None,None,None]])
print (df)
     0    1     2
0  1.0  2.0   3.0
1  4.0  NaN   8.0
2  NaN  5.0   9.0
3  7.0  NaN  10.0
4  0.0  2.0   NaN
5  5.0  NaN   NaN
6  NaN  6.0   NaN
7  NaN  8.0   NaN
8  NaN  NaN   NaN
9  NaN  NaN   NaN

g = (df.notna() & df.shift().isna()).cumsum()

def f(x):
    non_miss = x.dropna()
    return np.tile(non_miss, int(len(x) // len(non_miss) + 2))[:len(x)]

df = df.apply(lambda x: x.groupby(g[x.name]).transform(f))
print (df)
     0    1     2
0  1.0  2.0   3.0
1  4.0  2.0   8.0
2  1.0  5.0   9.0
3  7.0  5.0  10.0
4  0.0  2.0   3.0
5  5.0  2.0   8.0
6  7.0  6.0   9.0
7  0.0  8.0  10.0
8  5.0  6.0   3.0
9  7.0  8.0   8.0
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • This is brilliant, thank you! I've made an edit to the question now to clarify: Major edit: In my data the values will always be followed by NaNs in an unknown multiple of the length of values. Take df[0], for instance 1,4 would repeat for as long as there are NaNs. The only rule is that they will be a multiple of the length of the values (2) I made a mistake before in thinking that the NaNs would always equal the length of the preceding values, it is rather that they will be an unknown multiple of the preceding values until another value is met – mattmoore_bioinfo May 10 '22 at 12:33