1

I have a pandas data frame of which one column is date and another contains some value along with NaNs. Example, data frame is given below -

import pandas as pd
import numpy as np
df  = pd.DataFrame({'a':pd.date_range(start="01/01/2013",periods=6, freq="D"),'b': [1,np.nan,np.nan,4,np.nan,7]})
df
 a           b
 2013-01-01  1
 2013-01-02  NaN
 2013-01-03  NaN
 2013-01-04  4
 2013-01-05  NaN
 2013-01-05  7 

Now, i want to fill this NaNs with some fixed multiplier of previous value, i.e. above data frame will look something like this after transformation, if fixed multiplier is 0.5-

  a           b
 2013-01-01   1 
 2013-01-02   0.5
 2013-01-03   0.25
 2013-01-04   4
 2013-01-05   2
 2013-01-05  7

One of the way of doing this would be to loop over b and then use .loc function to transform it, i.e. we can use below given code snippet

for i in range(df.shape[0]):
    if np.isnan(df.loc[i, 'b']):
        df.loc[i, 'b'] = df.loc[i-1, 'b']*0.5

Though this works well, but it does not scale well - takes lot of time. So, it would be really helpful if someone can help me in doing it more efficiently using some pandas inbuilt function.

Thanks!

Pawan
  • 1,066
  • 1
  • 10
  • 16

2 Answers2

1

I don't think you can do this in a vectorised manner as your loop relies on the df being modified in place whilst looping, you can limit the operation on just the rows that contain NaN:

In [32]:
while df['b'].isnull().any():
    df.loc[df['b'].isnull(),'b'] = df['b'].shift() /2
df

Out[32]:
           a     b
0 2013-01-01  1.00
1 2013-01-02  0.50
2 2013-01-03  0.25
3 2013-01-04  4.00
4 2013-01-05  2.00
5 2013-01-06  7.00

timings

In [36]:
%%timeit
for i in range(df.shape[0]):
    if np.isnan(df.loc[i, 'b']):
        df.loc[i, 'b'] = df.loc[i-1, 'b']*0.5
1000 loops, best of 3: 1.27 ms per loop

In [40]:
%%timeit
while df['b'].isnull().any():
    df.loc[df['b'].isnull(),'b'] = df['b'].shift() /2

10000 loops, best of 3: 91 µs per loop
EdChum
  • 376,765
  • 198
  • 813
  • 562
  • Just checked, this one is quite fast even for a loopy solution. Nice! With my NumPy based one, the bottleneck seems to be the last step where it tries to update the dataframe column with the NumPy produced result. – Divakar Feb 15 '16 at 13:48
1

There is a dependency here that can be traced by a combination of diff + cumprod for a vectorized solution. The philosophy would be very similar to a diff + cumsum based solution for a MATLAB problem as dicussed in great detail here. The only change here is we are dealing with multiplication instead of summation. The implementation would look something like this -

# Input param
multiplier = 0.5

# Store reciprocal of multiplier
reci_multiplier = 1.0/multiplier

# Store the relevant column as a numpy array for further computations
df_arr = np.array(df['b'])

# Get nan mask for the input column
nan_mask = np.isnan(df_arr)

# Indices where the "shifts" occur (basically are the non NaN positions)
idx = np.where(~nan_mask)[0]

# Set nan positions with multiplier 
df_arr[nan_mask] = multiplier

# Scale non-nan positions with "position based multipliers" , which when
# "cumprod-ed" would result in our desired output. Update column in df.
df_arr[idx[1:]] /= df_arr[idx[:-1]]*(reci_multiplier**(-(np.diff(idx)-1)))
df['b'] = df_arr.cumprod()

Sample input, output -

Input dataframe: 
            a   b
0  2013-01-01   5
1  2013-01-02 NaN
2  2013-01-03 NaN
3  2013-01-04   4
4  2013-01-05 NaN
5  2013-01-06 NaN
6  2013-01-07 NaN
7  2013-01-08 NaN
8  2013-01-09   8
9  2013-01-10   3
10 2013-01-11 NaN
11 2013-01-12  11

multiplier = 0.4   # Note that this is different from the original 0.5
                   # to show some variety with the application of solution 
Output dataframe: 
            a        b
0  2013-01-01   5.0000
1  2013-01-02   2.0000
2  2013-01-03   0.8000
3  2013-01-04   4.0000
4  2013-01-05   1.6000
5  2013-01-06   0.6400
6  2013-01-07   0.2560
7  2013-01-08   0.1024
8  2013-01-09   8.0000
9  2013-01-10   3.0000
10 2013-01-11   1.2000
11 2013-01-12  11.0000
Community
  • 1
  • 1
Divakar
  • 218,885
  • 19
  • 262
  • 358
  • Should the 3rd row be 1.000 and the 5th row be 2.000? – EdChum Feb 15 '16 at 10:46
  • @EdChum Well in my sample input, output I am using a multiplier of `0.4` just to show some variety there. – Divakar Feb 15 '16 at 10:47
  • I think this is a more correct solution, mine just loops until the condition is met so it looks cleaner but I'm not sure it will scale well if you have a lot of sequential `NaN`s +1 – EdChum Feb 15 '16 at 13:49
  • @EdChum Well if there are number of sequential NaNs that means there are lesser number of elements in `idx` as `idx = np.where(~nan_mask)[0]`. Since, the next step uses `idx` to set elements, that means less indexing. All other steps aren't affected by the varying number of `NaNs`. So, I would think in that case the performance would be better instead. – Divakar Feb 15 '16 at 14:01
  • I was talking about my solution not yours, in mine `df['b'].shift() /2` will only affect a single `NaN` row at a time so if there are more sequential `NaN`s then it has to loop more – EdChum Feb 15 '16 at 14:02
  • @EdChum Ah yes, that's true I guess. Regarding my solution, if I make a new column in the dataframe, like `df['c'] = df_arr.cumprod()` instead of updating the existing column `df['b']`, suprisingly I see a good performance boost. Do you happen to know if this boost is expected? I don't have much experience with Pandas stuffs. – Divakar Feb 15 '16 at 14:06
  • Not sure, it could be that it's less work to assign a np memory block to a new column than overwriting an existing column where it has to delete, allocate a new block/copy the data but I'm guessing here – EdChum Feb 15 '16 at 14:11