51

pd.NA vs np.nan for pandas. Which one to use with pandas and why to use? What are main advantages and disadvantages of each of them with pandas?

Some sample code that uses them both:

import pandas as pd
import numpy as np

df = pd.DataFrame({ 'object': ['a', 'b', 'c',pd.NA],
                   'numeric': [1, 2, np.nan , 4],
                    'categorical': pd.Categorical(['d', np.nan,'f', 'g'])
                 })

output:

|    | object   |   numeric | categorical   |
|---:|:---------|----------:|:--------------|
|  0 | a        |         1 | d             |
|  1 | b        |         2 | nan           |
|  2 | c        |       nan | f             |
|  3 | <NA>     |         4 | g             |
vasili111
  • 6,032
  • 10
  • 50
  • 80
  • 1
    im pretty sure pd.NA is using np.nan in the back end. Pandas tends to use numpy in the back end a lot – Kenan Feb 07 '20 at 14:53
  • What version of pandas is this? – roganjosh Feb 07 '20 at 14:54
  • @roganjosh I am using v1.0.0 from Anaconda. – vasili111 Feb 07 '20 at 14:55
  • 7
    "*Compared to np.nan, pd.NA behaves differently in certain operations. In addition to arithmetic operations, pd.NA also propagates as “missing” or “unknown” in comparison operations*" from [here](https://pandas.pydata.org/pandas-docs/stable/whatsnew/v1.0.0.html#experimental-new-features) – anky Feb 07 '20 at 14:55
  • 2
    @kenan no, in this case, it is [distinct](https://pandas.pydata.org/pandas-docs/version/1.0.0/whatsnew/v1.0.0.html#experimental-new-features) – roganjosh Feb 07 '20 at 14:56
  • 1
    @roganjosh ahh i see it's a pandas 1.0 function, thank you for clearing that up for me – Kenan Feb 07 '20 at 14:59
  • @kenan no worries. It's quite a big feature that's only recently come about. I'm trying to see if there is a canonical for it already – roganjosh Feb 07 '20 at 15:02

6 Answers6

34

As of now (release of pandas-1.0.0) I would really recommend to use it carefully.

First, it's still an experimental feature:

Experimental: the behaviour of pd.NA can still change without warning.

Second, the behaviour differs from np.nan:

Compared to np.nan, pd.NA behaves differently in certain operations. In addition to arithmetic operations, pd.NA also propagates as “missing” or “unknown” in comparison operations.

Both quotas from release-notes

To show some additional example, I was surprised with interpolation behaviour:

Create simple DataFrame:

df = pd.DataFrame({"a": [0, pd.NA, 2], "b": [0, np.nan, 2]})
df
#       a    b
# 0     0  0.0
# 1  <NA>  NaN
# 2     2  2.0

and try to interpolate:

df.interpolate()
#       a    b
# 0     0  0.0
# 1  <NA>  1.0
# 2     2  2.0

There are some reasons for that (I am still discovering that), anyway, I just want to highlighted those differences - It is an experimental feature and it behaves differently in some cases.

I think it will be very useful feature, but I would be really careful with statements like "It should be completely fine to use it instead of np.nan". It might be true for most cases, but can cause some troubles when you are not aware of it.

Nerxis
  • 3,452
  • 2
  • 23
  • 39
  • 1
    Is this still considered experimental? – Ben Jones Sep 24 '22 at 06:08
  • 5
    @BenJones yes, in the latest version (1.5) it's still considered experimental: https://pandas.pydata.org/pandas-docs/version/1.5/user_guide/missing_data.html#experimental-na-scalar-to-denote-missing-values – Nerxis Sep 26 '22 at 08:19
  • 4
    `pd.NA` can often be very surprising. I used it to indicate missing values recently in lieu of `np.nan`, but the type caused other libraries to capriciously break. Notably, the library (Samplics) used `np.isfinite` as well as functions from `np.linalg` which both threw errors about the shape and type of the data. The errors were very confusing. I'm on version 1.5.3 of pandas and 1.24.2 for NumPy - so if anyone is wondering about the state of `pd.NA` in 2023, make sure you heed the warning about its experimental status for now even if it tends to work fine. – Joshua Megnauth Apr 01 '23 at 22:39
8

According to the docs

The goal of pd.NA is to provide a “missing” indicator that can be used consistently across data types

So if you have a column with multiple dtypes use pd.NA else np.nan should be fine.

However since pd.NA seem to have the same functionality as np.nan, it might just be better to use pd.NA for all your nan purposes

Note per comments:

pd.NA does not have exactly the same functionality, so be careful when switching. pd.NA propagates in equality operations and np.nan does not. pd.NA == 1 yields , but np.nan == 1 yields False

Kenan
  • 13,156
  • 8
  • 43
  • 50
  • 1
    From https://pandas.pydata.org/pandas-docs/stable/whatsnew/v1.0.0.html provided by @tdpr , it seems that `NA` is now experimental feature, so for something serious I think it should be avoided for now. – vasili111 Feb 14 '20 at 20:19
  • 8
    `pd.NA` does not have exactly the same functionality, so be careful when switching. `pd.NA` propagates in equality operations and `np.nan` does not. `pd.NA == 1` yields ``, but `np.nan == 1` yields `False`. – Steven Mar 07 '20 at 05:14
4

Both pd.NA and np.nan denote missing values in the dataframe.
The main difference that I have noticed is that np.nan is a floating point value while pd.NA stores an integer value. If you have column1 with all integers and some missing values in your dataset, and the missing values are replaced by np.nan, then the datatype of the column becomes a float, since np.nan is a float. But if you have column2 with all integers and some missing values in your dataset, and the missing values are replaced by pd.NA, then the datatype of the column remains an integer, since pd.NA is an integer. This might be useful if you want to keep any columns as int, and not change it to float.

SecretAgentMan
  • 2,856
  • 7
  • 21
  • 41
4

pd.NA is the new guy in town and is pandas own null value. A lot of datatypes are borrowed from numpy that includes np.nan.

Starting from pandas 1.0, an experimental pd.NA value (singleton) is available to represent scalar missing values. At this moment, it is used in the nullable integer, boolean and dedicated string data types as the missing value indicator.

The goal of pd.NA is provide a “missing” indicator that can be used consistently across data types (instead of np.nan, None or pd.NaT depending on the data type).

Lets build a df with all the different dtypes.

d = {'int': pd.Series([1, None], dtype=np.dtype("O")),
    'float': pd.Series([3.0, np.NaN], dtype=np.dtype("float")),
    'str': pd.Series(['test', None], dtype=np.dtype("str")),
    "bool": pd.Series([True, np.nan], dtype=np.dtype("O")),
    "date": pd.Series(['1/1/2000', np.NaN], dtype=np.dtype("O"))}
df1 = pd.DataFrame(data=d)

df1['date'] = pd.to_datetime(df1['date'], errors='coerce')
df1.info()
df1

output

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   int     1 non-null      object        
 1   float   1 non-null      float64       
 2   str     1 non-null      object        
 3   bool    1 non-null      object        
 4   date    1 non-null      datetime64[ns]
dtypes: datetime64[ns](1), float64(1), object(3)
memory usage: 208.0+ bytes
    int   float str     bool    date
0   1     3.0   test    True    2000-01-01
1   None  NaN   None    NaN     NaT

If you have a DataFrame or Series using traditional types that have missing data represented using np.nan, there are convenience methods convert_dtypes() in Series and convert_dtypes() in DataFrame that can convert data to use the newer dtypes for integers, strings and booleans and from v1.2 floats using convert_integer=False.

df1[['int', 'str', 'bool', 'date']] = df1[['int', 'str', 'bool', 'date']].convert_dtypes()
df1['float'] = df1['float'].convert_dtypes(convert_integer=False)
df1.info()
df1

output

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   int     1 non-null      Int64         
 1   float   1 non-null      Float64       
 2   str     1 non-null      string        
 3   bool    1 non-null      boolean       
 4   date    1 non-null      datetime64[ns]
dtypes: Float64(1), Int64(1), boolean(1), datetime64[ns](1), string(1)
memory usage: 200.0 bytes
    int     float   str     bool    date
0   1       3.0     test    True    2000-01-01
1   <NA>    <NA>    <NA>    <NA>    NaT

Note the capital 'F' to distinguish from np.float32 or np.float64, also note string which is the new pandas StringDtype (from Pandas 1.0) and not str or object. Also pd.Int64 (from pandas 0.24) nullable integer capital 'I' and not np.int64.

For more on datatypes read here and here. This page has some good info on subtypes.

I am using pandas v1.2.4 so hopeful in time we will have a universal null value for all datatypes which will warm our hearts.

Warning this is new and experimental use careful for now.

Cam
  • 1,263
  • 13
  • 22
  • Thanks, but to be honest `pd.NA` still confuses me a bit. Have you tried using this NA value directly, e.g. for your first column (int) like this: `'int': pd.Series([1, pd.NA], dtype=np.dtype("O"))` (`pd.NA` instead of `None`)? Because with that even after using `convert_dtypes` method the column type stays the same (`object` instead of `Int64`). – Nerxis Nov 02 '21 at 08:50
  • @Nerxis At this moment, pd.NA is used in the nullable integer, boolean and dedicated string data types only. There is a discussion on your point on using object here https://github.com/pandas-dev/pandas/issues/32931 – Cam Nov 02 '21 at 21:56
  • Yes, I understand this, but my point was that `convert_dtypes` does not convert this column of object type into `Int64`, this should be supported. But thanks for the link, they discuss this including the function of `convert_dtypes` where the docstring is a bit confusing (differ from real behavior). – Nerxis Nov 03 '21 at 07:40
2

pd.NA was introduced in the recent release of pandas-1.0.0.

I would recommend using it over np.nan, since it is contained in the pandas library it should work best with the DataFrames.

tdpr
  • 212
  • 1
  • 4
  • 1
    From your link it seems that `NA` is now experimental feature, so for something serious I think it should be avoided for now. – vasili111 Feb 14 '20 at 20:17
  • 1
    `pd.NA` does not have exactly the same functionality, so be careful when switching. `pd.NA` propagates in equality operations and `np.nan` does not. `pd.NA == 1` yields ``, but `np.nan == 1` yields `False`. – Steven Mar 07 '20 at 05:15
2

pd.NA is still experimental (https://pandas.pydata.org/docs/user_guide/missing_data.html) and can have undesired outcomes.

For example:

import pandas as pd
df = pd.DataFrame({'id':[1,2,3]})
df.id.replace(2, pd.NA, inplace=True)
df.id.replace(3, pd.NA, inplace=True)

Pandas 1.2.4:

id
0 1
1 <NA>
2 3

Pandas 1.4.2:

AttributeError: 'bool' object has no attribute 'to_numpy'

It appears that pd.NA changes the data frame in a way that the second replacement doesn't work anymore.

The same code with np.nan works without problems.

import pandas as pd
import numpy as np
df = pd.DataFrame({'id':[1,2,3]})
df.id.replace(2, np.nan, inplace=True)
df.id.replace(3, np.nan, inplace=True)
Benjamin Ziepert
  • 1,345
  • 1
  • 15
  • 19