1

I'm very confused by the output of the pct_change function when data with NaN values are involved. The first several rows of output in the right column are correct - it gives the percentage change in decimal form of the cell to the left in Column A relative to the cell in Column A two rows prior. But as soon as it reaches the NaN values in Column A, the output of the pct_change function makes no sense.

For example:

Row 8: NaN is 50% greater than 2?

Row 9: NaN is 0% greater than 3? 

Row 11: 4 is 33% greater than NaN?

Row 12: 2 is 33% less than NaN?`

Based on the above math, it seems like pct_change is assigning NaN a value of "3". Is that because pct_change effectively fills forward the last non-NaN value? Could someone please explain the logic here and why this happens?

import pandas as pd
import numpy as np

df = pd.DataFrame({'A': [2,1,3,1,4,5,2,3,np.nan,np.nan,np.nan,4,2,1,0,4]})
x = 2
df['pctchg_A'] = df['A'].pct_change(periods = x)

print(df.to_string())

Here's the output:

enter image description here

Mayank Porwal
  • 33,470
  • 8
  • 37
  • 58
essoclub94
  • 33
  • 1
  • 4

1 Answers1

2

The behaviour is as expected. You need to carefully read the df.pct_change docs.

As per docs:

fill_method: str, default ‘pad’
How to handle NAs before computing percent changes.

Here, method pad means, it will forward-fill the NaN values with the nearest non-NaN value.

So, if you ffill or pad your NaN values, you will understand what's exactly happening. Check this out:

In [3201]: df['padded_A'] = df['A'].fillna(method='pad')

In [3203]: df['pctchg_A'] = df['A'].pct_change(periods = x)

In [3204]: df
Out[3204]: 
      A  padded_A  pctchg_A
0   2.0       2.0       NaN
1   1.0       1.0       NaN
2   3.0       3.0  0.500000
3   1.0       1.0  0.000000
4   4.0       4.0  0.333333
5   5.0       5.0  4.000000
6   2.0       2.0 -0.500000
7   3.0       3.0 -0.400000
8   NaN       3.0  0.500000
9   NaN       3.0  0.000000
10  NaN       3.0  0.000000
11  4.0       4.0  0.333333
12  2.0       2.0 -0.333333
13  1.0       1.0 -0.750000
14  0.0       0.0 -1.000000
15  4.0       4.0  3.000000

Now you can compare padded_A values with pctchg_A and see that it works as expected.

Mayank Porwal
  • 33,470
  • 8
  • 37
  • 58
  • 2
    Thank you - this makes sense. I'm still learning and while I'm familiar with fillna, I didn't understand the meaning of "pad". In looking at the docs, I see that "pad" is the default method - how can I see the other options for fill_method? For example, I would prefer to have the result just be "NaN" whenever there's a "NaN" involved in the calc from Column A. – essoclub94 Nov 23 '20 at 14:36
  • 1
    Check [`this`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html) out for `fillna`. – Mayank Porwal Nov 24 '20 at 04:36