11

I have a dfAB

import pandas as pd
import random

A = [ random.randint(0,100) for i in range(10) ]
B = [ random.randint(0,100) for i in range(10) ]

dfAB = pd.DataFrame({ 'A': A, 'B': B })
dfAB

We can take the quantile function, because I want to know the 75th percentile of the columns:

dfAB.quantile(0.75)

But say now I put some NaNs in the dfAB and re-do the function, obviously its differnt:

dfAB.loc[5:8]=np.nan
dfAB.quantile(0.75)

Basically, when I calculated the mean of the dfAB, I passed skipna to ignore Na's as I didn't want them affecting my stats (I have quite a few in my code, on purpose, and obv making them zero doesn't help)

dfAB.mean(skipna=True)

Thus, what im getting at is whether/how the quantile function addresses NaN's?

Chong Onn Keat
  • 520
  • 2
  • 8
  • 19
Junaid Mohammad
  • 457
  • 1
  • 6
  • 18
  • Well, if you pass skipna=True, I guess it skips them. – Martino Sep 04 '18 at 17:31
  • If you not pass skipna=True , in mean , if it have nan , it will return nan – BENY Sep 04 '18 at 17:34
  • 1
    Don't ask us; we're biological units. Try it and see what happens. Load a df with half `NaN` values and play around for a few minutes. – Prune Sep 04 '18 at 17:34
  • side comment on the way you generate A, B. you can just A = np.random.randint(100, size=10) – Trenton McKinney Sep 04 '18 at 17:40
  • Docs didn't have a reference to skipnan for quantile function, that's why I asked.. DataFrame.quantile(q=0.5, axis=0, numeric_only=True, interpolation='linear') @sacul kindly highlighted the correct comparator, which I didn't know existed, in np.nanpercentile Thanks all – Junaid Mohammad Sep 04 '18 at 17:49

2 Answers2

17

Yes, this appears to be the way that pd.quantile deals with NaN values. To illustrate, you can compare the results to np.nanpercentile, which explicitely Computes the qth percentile of the data along the specified axis, while ignoring nan values (quoted from the docs, my emphasis):

>>> dfAB
      A     B
0   5.0  10.0
1  43.0  67.0
2  86.0   2.0
3  61.0  83.0
4   2.0  27.0
5   NaN   NaN
6   NaN   NaN
7   NaN   NaN
8   NaN   NaN
9  27.0  70.0

>>> dfAB.quantile(0.75)
A    56.50
B    69.25
Name: 0.75, dtype: float64

>>> np.nanpercentile(dfAB, 75, axis=0)
array([56.5 , 69.25])

And see that they are equivalent

sacuL
  • 49,704
  • 8
  • 81
  • 106
  • For Pandas v2.0 and up the default for numeric_only is False. See [docs](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.quantile.html). I expect this will change the output of the answer here. – Frank_Coumans Jul 24 '23 at 15:38
3

Yes. pd.quantile() will ignore NaN values when calculating the quantile.

To prove this, we can compare it with np.nanquantile, which compute the qth quantile of the data along the specified axis, while ignoring nan values[source] .

>>> random.seed(7)
>>> A = [ random.randint(0,100) for i in range(10) ]
>>> B = [ random.randint(0,100) for i in range(10) ]
>>> dfAB = pd.DataFrame({'A': A, 'B': B})
>>> dfAB.loc[5:8]=np.nan

>>> dfAB
      A     B
0  41.0   7.0
1  19.0  64.0
2  50.0  27.0
3  83.0   4.0
4   6.0  11.0
5   NaN   NaN
6   NaN   NaN
7   NaN   NaN
8   NaN   NaN
9  74.0  11.0

>>> dfAB.quantile(0.75)
A    68.0
B    23.0
Name: 0.75, dtype: float64

>>> np.nanquantile(dfAB, 0.75, axis=0)
array([68.  23.])
Chong Onn Keat
  • 520
  • 2
  • 8
  • 19