pd.NA is the new guy in town and is pandas own null value. A lot of datatypes are borrowed from numpy that includes np.nan.
Starting from pandas 1.0, an experimental pd.NA value (singleton) is available to represent scalar missing values. At this moment, it is used in the nullable integer, boolean and dedicated string data types as the missing value indicator.
The goal of pd.NA
is provide a “missing” indicator that can be used consistently across data types (instead of np.nan
, None
or pd.NaT
depending on the data type).
Lets build a df with all the different dtypes.
d = {'int': pd.Series([1, None], dtype=np.dtype("O")),
'float': pd.Series([3.0, np.NaN], dtype=np.dtype("float")),
'str': pd.Series(['test', None], dtype=np.dtype("str")),
"bool": pd.Series([True, np.nan], dtype=np.dtype("O")),
"date": pd.Series(['1/1/2000', np.NaN], dtype=np.dtype("O"))}
df1 = pd.DataFrame(data=d)
df1['date'] = pd.to_datetime(df1['date'], errors='coerce')
df1.info()
df1
output
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 int 1 non-null object
1 float 1 non-null float64
2 str 1 non-null object
3 bool 1 non-null object
4 date 1 non-null datetime64[ns]
dtypes: datetime64[ns](1), float64(1), object(3)
memory usage: 208.0+ bytes
int float str bool date
0 1 3.0 test True 2000-01-01
1 None NaN None NaN NaT
If you have a DataFrame or Series using traditional types that have missing data represented using np.nan, there are convenience methods convert_dtypes()
in Series and convert_dtypes()
in DataFrame that can convert data to use the newer dtypes for integers, strings and booleans and from v1.2 floats using convert_integer=False
.
df1[['int', 'str', 'bool', 'date']] = df1[['int', 'str', 'bool', 'date']].convert_dtypes()
df1['float'] = df1['float'].convert_dtypes(convert_integer=False)
df1.info()
df1
output
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 int 1 non-null Int64
1 float 1 non-null Float64
2 str 1 non-null string
3 bool 1 non-null boolean
4 date 1 non-null datetime64[ns]
dtypes: Float64(1), Int64(1), boolean(1), datetime64[ns](1), string(1)
memory usage: 200.0 bytes
int float str bool date
0 1 3.0 test True 2000-01-01
1 <NA> <NA> <NA> <NA> NaT
Note the capital 'F' to distinguish from np.float32
or np.float64
, also note string
which is the new pandas StringDtype
(from Pandas 1.0) and not str
or object
.
Also pd.Int64
(from pandas 0.24) nullable integer capital 'I' and not np.int64
.
For more on datatypes read here and here. This page has some good info on subtypes.
I am using pandas v1.2.4 so hopeful in time we will have a universal null value for all datatypes which will warm our hearts.
Warning this is new and experimental use careful for now.