Replace NaNs with some multiple of previous value in Pandas Dataframe

Question

I have a pandas data frame of which one column is date and another contains some value along with NaNs. Example, data frame is given below -

import pandas as pd
import numpy as np
df  = pd.DataFrame({'a':pd.date_range(start="01/01/2013",periods=6, freq="D"),'b': [1,np.nan,np.nan,4,np.nan,7]})
df
 a           b
 2013-01-01  1
 2013-01-02  NaN
 2013-01-03  NaN
 2013-01-04  4
 2013-01-05  NaN
 2013-01-05  7

Now, i want to fill this NaNs with some fixed multiplier of previous value, i.e. above data frame will look something like this after transformation, if fixed multiplier is 0.5-

  a           b
 2013-01-01   1 
 2013-01-02   0.5
 2013-01-03   0.25
 2013-01-04   4
 2013-01-05   2
 2013-01-05  7

One of the way of doing this would be to loop over b and then use .loc function to transform it, i.e. we can use below given code snippet

for i in range(df.shape[0]):
    if np.isnan(df.loc[i, 'b']):
        df.loc[i, 'b'] = df.loc[i-1, 'b']*0.5

Though this works well, but it does not scale well - takes lot of time. So, it would be really helpful if someone can help me in doing it more efficiently using some pandas inbuilt function.

Thanks!

Your code doesn't work as it will overwrite non `NaN` values with `1` — EdChum, Feb 15 '16 at 09:37
The result of the snippet is different with your expected DataFrame after transformation, please specify the expected DataFrame. — LuckyQueen, Feb 15 '16 at 09:39
Based on your description why isn't the second to last row `2`? — EdChum, Feb 15 '16 at 09:40

score 1 · Answer 1 · answered Feb 15 '16 at 10:26

I don't think you can do this in a vectorised manner as your loop relies on the df being modified in place whilst looping, you can limit the operation on just the rows that contain NaN:

In [32]:
while df['b'].isnull().any():
    df.loc[df['b'].isnull(),'b'] = df['b'].shift() /2
df

Out[32]:
           a     b
0 2013-01-01  1.00
1 2013-01-02  0.50
2 2013-01-03  0.25
3 2013-01-04  4.00
4 2013-01-05  2.00
5 2013-01-06  7.00

timings

In [36]:
%%timeit
for i in range(df.shape[0]):
    if np.isnan(df.loc[i, 'b']):
        df.loc[i, 'b'] = df.loc[i-1, 'b']*0.5
1000 loops, best of 3: 1.27 ms per loop

In [40]:
%%timeit
while df['b'].isnull().any():
    df.loc[df['b'].isnull(),'b'] = df['b'].shift() /2

10000 loops, best of 3: 91 µs per loop

Just checked, this one is quite fast even for a loopy solution. Nice! With my NumPy based one, the bottleneck seems to be the last step where it tries to update the dataframe column with the NumPy produced result. — Divakar, Feb 15 '16 at 13:48

score 1 · Answer 2 · edited May 23 '17 at 12:15

1

There is a dependency here that can be traced by a combination of diff + cumprod for a vectorized solution. The philosophy would be very similar to a diff + cumsum based solution for a MATLAB problem as dicussed in great detail here. The only change here is we are dealing with multiplication instead of summation. The implementation would look something like this -

# Input param
multiplier = 0.5

# Store reciprocal of multiplier
reci_multiplier = 1.0/multiplier

# Store the relevant column as a numpy array for further computations
df_arr = np.array(df['b'])

# Get nan mask for the input column
nan_mask = np.isnan(df_arr)

# Indices where the "shifts" occur (basically are the non NaN positions)
idx = np.where(~nan_mask)[0]

# Set nan positions with multiplier 
df_arr[nan_mask] = multiplier

# Scale non-nan positions with "position based multipliers" , which when
# "cumprod-ed" would result in our desired output. Update column in df.
df_arr[idx[1:]] /= df_arr[idx[:-1]]*(reci_multiplier**(-(np.diff(idx)-1)))
df['b'] = df_arr.cumprod()

Sample input, output -

Input dataframe: 
            a   b
0  2013-01-01   5
1  2013-01-02 NaN
2  2013-01-03 NaN
3  2013-01-04   4
4  2013-01-05 NaN
5  2013-01-06 NaN
6  2013-01-07 NaN
7  2013-01-08 NaN
8  2013-01-09   8
9  2013-01-10   3
10 2013-01-11 NaN
11 2013-01-12  11

multiplier = 0.4   # Note that this is different from the original 0.5
                   # to show some variety with the application of solution 
Output dataframe: 
            a        b
0  2013-01-01   5.0000
1  2013-01-02   2.0000
2  2013-01-03   0.8000
3  2013-01-04   4.0000
4  2013-01-05   1.6000
5  2013-01-06   0.6400
6  2013-01-07   0.2560
7  2013-01-08   0.1024
8  2013-01-09   8.0000
9  2013-01-10   3.0000
10 2013-01-11   1.2000
11 2013-01-12  11.0000

edited May 23 '17 at 12:15

Community

1
1

answered Feb 15 '16 at 10:40

Divakar

218,885
19
262
358

Should the 3rd row be 1.000 and the 5th row be 2.000? – EdChum Feb 15 '16 at 10:46
@EdChum Well in my sample input, output I am using a multiplier of `0.4` just to show some variety there. – Divakar Feb 15 '16 at 10:47
I think this is a more correct solution, mine just loops until the condition is met so it looks cleaner but I'm not sure it will scale well if you have a lot of sequential `NaN`s +1 – EdChum Feb 15 '16 at 13:49
@EdChum Well if there are number of sequential NaNs that means there are lesser number of elements in `idx` as `idx = np.where(~nan_mask)[0]`. Since, the next step uses `idx` to set elements, that means less indexing. All other steps aren't affected by the varying number of `NaNs`. So, I would think in that case the performance would be better instead. – Divakar Feb 15 '16 at 14:01
I was talking about my solution not yours, in mine `df['b'].shift() /2` will only affect a single `NaN` row at a time so if there are more sequential `NaN`s then it has to loop more – EdChum Feb 15 '16 at 14:02
@EdChum Ah yes, that's true I guess. Regarding my solution, if I make a new column in the dataframe, like `df['c'] = df_arr.cumprod()` instead of updating the existing column `df['b']`, suprisingly I see a good performance boost. Do you happen to know if this boost is expected? I don't have much experience with Pandas stuffs. – Divakar Feb 15 '16 at 14:06
Not sure, it could be that it's less work to assign a np memory block to a new column than overwriting an existing column where it has to delete, allocate a new block/copy the data but I'm guessing here – EdChum Feb 15 '16 at 14:11

Replace NaNs with some multiple of previous value in Pandas Dataframe

2 Answers2