This is somewhat of a broad topic, but I will try to pare it to some specific questions.
In starting to answer questions on SO, I have found myself sometimes running into a silly error like this when making toy data:
In[0]:
import pandas as pd
df = pd.DataFrame({"values":[1,2,3,4,5,6,7,8,9]})
df[df < 5] = np.nan
Out[0]:
NameError: name 'np' is not defined
I'm so used to automatically importing numpy
with pandas
that this doesn't usually occur in real code. However, it did make me wonder why pandas
doesn't have it's own value/object for representing null values.
I only recently realized that you could just use the Python None
instead for a similar situation:
import pandas as pd
df = pd.DataFrame({"values":[1,2,3,4,5,6,7,8,9]})
df[df < 5] = None
Which works as expected and doesn't produce an error. But I have felt like the convention on SO that I have seen is to use np.nan
, and that people are usually referring to np.nan
when discussing null values (this is perhaps why I hadn't realized None
can be used, but maybe that was my own idiosyncrasy).
Briefly looking into this, I have seen now that pandas
does have a pandas.NA
value since 1.0.0, but I have never seen anyone use it in a post:
In[0]:
import pandas as pd
import numpy as np
df = pd.DataFrame({'values':np.random.rand(20,)})
df['above'] = df['values']
df['below'] = df['values']
df['above'][df['values']>0.7] = np.nan
df['below'][df['values']<0.3] = pd.NA
df['names'] = ['a','b','c','a','b','c','a','b','c','a']*2
df.loc[df['names']=='a','names'] = pd.NA
df.loc[df['names']=='b','names'] = np.nan
df.loc[df['names']=='c','names'] = None
df
Out[0]:
values above below names
0 0.323531 0.323531 0.323531 <NA>
1 0.690383 0.690383 0.690383 NaN
2 0.692371 0.692371 0.692371 None
3 0.259712 0.259712 NaN <NA>
4 0.473505 0.473505 0.473505 NaN
5 0.907751 NaN 0.907751 None
6 0.642596 0.642596 0.642596 <NA>
7 0.229420 0.229420 NaN NaN
8 0.576324 0.576324 0.576324 None
9 0.823715 NaN 0.823715 <NA>
10 0.210176 0.210176 NaN <NA>
11 0.629563 0.629563 0.629563 NaN
12 0.481969 0.481969 0.481969 None
13 0.400318 0.400318 0.400318 <NA>
14 0.582735 0.582735 0.582735 NaN
15 0.743162 NaN 0.743162 None
16 0.134903 0.134903 NaN <NA>
17 0.386366 0.386366 0.386366 NaN
18 0.313160 0.313160 0.313160 None
19 0.695956 0.695956 0.695956 <NA>
So it seems that for numerical values, the distinction between these different null values doesn't matter, but they are represented differently for strings (and perhaps for other data types?).
My questions based on the above:
- Is it conventional to use
np.nan
(rather thanNone
) to represent null values inpandas
? - Why did
pandas
not have its own null value for most of its lifetime (until last year)? What was the motivation for adding? - In cases where you can have multiple types of missing values in one
Series
or column, is there any difference between them? Why are they not represented identically (as with numerical data)?
I fully anticipate that I may have a flawed interpretation of things and the distinction between pandas
and numpy
, so please correct me.