2

I'm trying to paper over missing data in a dataframe by grouping on one column and then flood-filling (bfill().ffill()) subsets of columns inside the groups.

I was previously using

def ffbf(x):
   return x.ffill().bfill()

df[some_cols] = df.groupby(group_key)[some_cols].transform(ffbf)

but transform becomes unbelievably slow even on relatively small dataframes (already several seconds for only 3000x20), so I wanted to see if I could apply ffill and bfill directly to the groups since they're supposed to be cythonized now.

Am I correct in thinking that I need to invoke groupby again in between ffill and bfill because neither method preserves the groupings?

Right now I have

df[some_cols] = df[some_cols].groupby(group_key).ffill().groupby(group_key).bfill()

and I think that it's doing what I want, and it's waaaaaaayyy faster than using transform, but I'm not experienced enough with pandas to be certain, so I figured I'd ask.

[edit] It looks like this change is jumbling my data. Why?

ackermanj
  • 23
  • 4
  • You should have asked after spending enough time on it, because that looks fine as is. – Kenan Jul 31 '18 at 03:12
  • It's not fine at all. I see now that the data has been getting jumbled up by this change, but I don't know why. – ackermanj Jul 31 '18 at 04:33

1 Answers1

3

I my opinion here is necessary another groupby with bfill for avoid replace NaNs for only NaNs group from another one.

For performance is used this code:

In [205]: %timeit df1[some_cols] = df1.groupby(group_key)[some_cols].transform(ffbf)
443 ms ± 7.26 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [206]: %timeit df[[group_key] + some_cols] = df[[group_key] + some_cols].groupby(group_key).ffill().groupby(group_key).bfill()
5.69 ms ± 31.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

np.random.seed(785)

N = 10000
df = pd.DataFrame({'key':np.random.randint(1000, size=N),
                   'A':np.random.choice([1,2,np.nan], size=N),
                   'B':np.random.choice([1,4,np.nan], size=N),
                   'C':np.random.choice([7,0,np.nan], size=N),
                   'D':np.random.choice([7,0,8], size=N)})

df = df.sort_values('key')
print (df)

def ffbf(x):
   return x.ffill().bfill()

group_key = 'key'
some_cols = ['A','B','C']
df1 = df.copy()
df1[some_cols] = df1.groupby(group_key)[some_cols].transform(ffbf)

#a bit chamgef solution for working in pandas 0.23.1
df[[group_key] + some_cols] = df[[group_key] + some_cols].groupby(group_key).ffill().groupby(group_key).bfill()

print (df.equals(df1))
True

EDIT: In next pandas versions (test pandas 1.1.1) is possible use:

df[[group_key] + some_cols] = df[[group_key] + some_cols].groupby(df[group_key]).ffill().groupby(df[group_key]).bfill()
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • how to Remove NaN from the cell array without dropping the entire row, in a single line for a dataframe? – Pyd Jul 31 '18 at 09:15
  • 1
    @pyd - You can only shift all values like [this solution](https://stackoverflow.com/q/43119503/2901002), because need always same DataFrame structure. (only change axis=1) to `apply`. last is possible remove only NaNs rows by `df = df.dropna(how='all')` – jezrael Jul 31 '18 at 09:18
  • @jezrael Do you know why using `df[[group_key] + some_cols]` in your answer gives a different result than `df[some_cols + [group_key]]`? It looks like the order matters, but I don't understand why. – ackermanj Jul 31 '18 at 14:38
  • @ackermanj In my opinion it is awfull bug, universal solution should be with join - `df = df.drop(some_cols + [group_key], axis=1).join(df[some_cols + [group_key]].groupby(group_key).ffill().groupby(group_key).bfill()).reindex(columns=df.columns)` – jezrael Aug 01 '18 at 04:28
  • This no longer works, due to this regression in functionality: https://pandas.pydata.org/pandas-docs/stable/whatsnew/v0.25.0.html#dataframe-groupby-ffill-bfill-no-longer-return-group-labels – Isaac Sep 24 '20 at 08:57
  • @Isaac - need `df[[group_key] + some_cols] = df[[group_key] + some_cols].groupby(df[group_key]).ffill().groupby(df[group_key]).bfill()` – jezrael Sep 24 '20 at 09:10