0

I want to use bfill on a pandas dataframe but I want the value to use for each backfill to be dependant on the values in the row.

Example input:

           type   val
2018-12-31    H     1
2019-03-31  NaN   NaN
2019-06-30    Q     2
2019-07-31  NaN   NaN
2019-08-31    H     3
2019-09-30    Y     4 
2019-12-31    Q     5

Expected output:

           type   val
2018-12-31    H     1
2019-03-31    Q     2  <-- Same as 2019-06-30
2019-06-30    Q     2
2019-07-31    Q     6  <-- Double 2019-08-31
2019-08-31    H     3
2019-09-30    Y     4
2019-12-31    Q     5

In this example, the backfilled value for 2019-07-31 is 6 because it has a H type, i.e. it's double the (2019-08-31, H) value. On the other hand, the backfilled value for 2019-03-31 is the same as the next row since that type is Q.

Rules:

  • Type H: double the value for backfill
  • Type Q and Y: keep the value for backfill
  • All types: Set type to Q

I could not find any straightforward built in way of doing this. I need to do this on a very large dataframe so speed is important to me, and it's why I can't loop.

Jokab
  • 2,939
  • 1
  • 15
  • 26

2 Answers2

2

First create mask for values multiple by 2 for rows with missing values and for backfilled H values:

m = df['type'].isna() & df['type'].bfill().eq('H')

Then back fill values of val and multiple by 2 by mask:

df['val'] = df['val'].bfill().mask(m, lambda x: x * 2)

Last replace missing value in type column:

df['type'] = df['type'].fillna('Q')
print (df)
           type  val
2018-12-31    H  1.0
2019-03-31    Q  2.0
2019-06-30    Q  2.0
2019-07-31    Q  6.0
2019-08-31    H  3.0
2019-09-30    Y  4.0
2019-12-31    Q  5.0
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
1

Here's one approach:

# generate a series m, which contains 2 where the
# next value is H
m = df.type.eq('H').shift(-1).add(1).fillna(1)
# Fill missing values in type with H
df.type.fillna('H', inplace = True)
# fillna in val and mulitply with m
df['val'] = df.val.fillna(df.val.shift(-1) * m)

print(df)

            type  val
2018-12-31    H  1.0
2019-03-31    H  2.0
2019-06-30    Q  2.0
2019-07-31    H  6.0
2019-08-31    H  3.0
2019-09-30    Y  4.0
2019-12-31    Q  5.0

Where:

print(m)

2018-12-31    1
2019-03-31    1
2019-06-30    1
2019-07-31    2
2019-08-31    1
2019-09-30    1
2019-12-31    1
Name: type, dtype: int64
yatu
  • 86,083
  • 12
  • 84
  • 139