6

This is somewhat of a broad topic, but I will try to pare it to some specific questions.

In starting to answer questions on SO, I have found myself sometimes running into a silly error like this when making toy data:

In[0]:

import pandas as pd

df = pd.DataFrame({"values":[1,2,3,4,5,6,7,8,9]})
df[df < 5] = np.nan

Out[0]:
NameError: name 'np' is not defined

I'm so used to automatically importing numpy with pandas that this doesn't usually occur in real code. However, it did make me wonder why pandas doesn't have it's own value/object for representing null values.

I only recently realized that you could just use the Python None instead for a similar situation:

import pandas as pd

df = pd.DataFrame({"values":[1,2,3,4,5,6,7,8,9]})
df[df < 5] = None

Which works as expected and doesn't produce an error. But I have felt like the convention on SO that I have seen is to use np.nan, and that people are usually referring to np.nan when discussing null values (this is perhaps why I hadn't realized None can be used, but maybe that was my own idiosyncrasy).

Briefly looking into this, I have seen now that pandas does have a pandas.NA value since 1.0.0, but I have never seen anyone use it in a post:

In[0]:

import pandas as pd
import numpy as np

df = pd.DataFrame({'values':np.random.rand(20,)})
df['above'] = df['values']
df['below'] = df['values']
df['above'][df['values']>0.7] = np.nan
df['below'][df['values']<0.3] = pd.NA

df['names'] = ['a','b','c','a','b','c','a','b','c','a']*2
df.loc[df['names']=='a','names'] = pd.NA
df.loc[df['names']=='b','names'] = np.nan
df.loc[df['names']=='c','names'] = None
df

Out[0]:
      values     above     below names
0   0.323531  0.323531  0.323531  <NA>
1   0.690383  0.690383  0.690383   NaN
2   0.692371  0.692371  0.692371  None
3   0.259712  0.259712       NaN  <NA>
4   0.473505  0.473505  0.473505   NaN
5   0.907751       NaN  0.907751  None
6   0.642596  0.642596  0.642596  <NA>
7   0.229420  0.229420       NaN   NaN
8   0.576324  0.576324  0.576324  None
9   0.823715       NaN  0.823715  <NA>
10  0.210176  0.210176       NaN  <NA>
11  0.629563  0.629563  0.629563   NaN
12  0.481969  0.481969  0.481969  None
13  0.400318  0.400318  0.400318  <NA>
14  0.582735  0.582735  0.582735   NaN
15  0.743162       NaN  0.743162  None
16  0.134903  0.134903       NaN  <NA>
17  0.386366  0.386366  0.386366   NaN
18  0.313160  0.313160  0.313160  None
19  0.695956  0.695956  0.695956  <NA>

So it seems that for numerical values, the distinction between these different null values doesn't matter, but they are represented differently for strings (and perhaps for other data types?).

My questions based on the above:

  • Is it conventional to use np.nan (rather than None) to represent null values in pandas?
  • Why did pandas not have its own null value for most of its lifetime (until last year)? What was the motivation for adding?
  • In cases where you can have multiple types of missing values in one Series or column, is there any difference between them? Why are they not represented identically (as with numerical data)?

I fully anticipate that I may have a flawed interpretation of things and the distinction between pandas and numpy, so please correct me.

Tom
  • 8,310
  • 2
  • 16
  • 36
  • `import numpy as np` can be found throughout `pandas` code. They discourage using `pd.np`, but encourage your to do your own import. It won't take up any more 'memory'. – hpaulj Jun 20 '20 at 18:12
  • 1
    If the column/Series is numeric (integer) assigning any of these "NA" makes if `float` and inserts `np.nan`. If object dtype (as with the strings column), the actual `np.nan`, `None` or `pd.NA` is inserted. – hpaulj Jun 20 '20 at 18:20
  • 2
    `np.nan` is a "IEEE 754 floating point' value, so can be used efficiently in numeric operations (the fast compiled whole-array `numpy` code). So it's use, by any alias, in a numeric dtype Series makes a lot of sense. That doesn't apply to object dtype Series, so any convenient object can be used there. – hpaulj Jun 20 '20 at 18:30
  • @hpaulj good input as well, this in combination with ALollz's answer is comprehensive – Tom Jun 20 '20 at 21:38

3 Answers3

7

A main dependency of pandas is numpy, in other words, pandas is built on-top of numpy. Because pandas inherits and uses many of the numpy methods, it makes sense to keep things consistent, that is, missing numeric data are represented with np.NaN.

(This choice to build upon numpy has consequences for other things too. For instance date and time operations are built upon the np.timedelta64 and np.datetime64 dtypes, not the standard datetime module.)


One thing you may not have known is that numpy has always been there with pandas

import pandas as pd
pd.np?
pd.np.nan

Though you might think this behavior could be better since you don't import numpy, this is discouraged and in the near future will be deprecated in favor of directly importing numpy

FutureWarning: The pandas.np module is deprecated and will be removed from pandas in a future version. Import numpy directly instead


Is it conventional to use np.nan (rather than None) to represent null values in pandas?

If the data are numeric then yes, you should use np.NaN. None requires the dtype to be Object and with pandas you want numeric data stored in a numeric dtype. pandas will generally coerce to the proper null-type upon creation or import so that it can use the correct dtype

pd.Series([1, None])
#0    1.0
#1    NaN        <- None became NaN so it can have dtype: float64
#dtype: float64

Why did pandas not have its own null value for most of its lifetime (until last year)? What was the motivation for adding?

pandas did not have it's own null value because it got by with np.NaN, which worked for the majority of circumstances. However with pandas it's very common to have missing data, an entire section of the documentation is devoted to this. NaN, being a float, does not fit into an integer container which means that any numeric Series with missing data is upcast to float. This can become problematic because of floating point math, and some integers cannot be represented perfectly with by a floating point number. As a result, any joins or merges could possibly fail.

# Gets upcast to float
pd.Series([1,2,np.NaN])
#0    1.0
#1    2.0
#2    NaN
#dtype: float64

# Can safely do merges/joins/math because things are still Int
pd.Series([1,2,np.NaN]).astype('Int64')
#0       1
#1       2
#2    <NA>
#dtype: Int64
ALollz
  • 57,915
  • 7
  • 66
  • 89
  • 1
    Nice answer, thank you! I didn't know about `pd.np`, you were right – Tom Jun 20 '20 at 21:36
  • 1
    Re "some integers cannot be represented perfectly with by a floating point number" - a 64-bit float can represent integers exactly up to `2**53` (9e+15), compared to `2**63` for int64 (9e+18), so in the vast majority of cases, you're not going to run into problems. – Han-Kwang Nienhuys Jun 21 '20 at 11:04
1
  • Firstly, you can unify the nan values by a filter-function that returns only one value, let's say None.
  • I guess the reason is to make it unique in case of data-mining on data from numpy calculations or so on. So, the pandas nan means something different. Maybe, it does not make sense here in your special case, but it will have a meaning in other cases.
  • Thanks for your input! Yes you're right about your first point, and they all get caught by `.isna()` or `.isnull()` I believe – Tom Jun 20 '20 at 21:37
  • That is maybe one way to do it, but I would think in a different way by defining the filters as `filter1 = df[key==NaN]` `filter2 = df[key==None]` `filter3 = df[key=='']` Then, you can simply use `df.loc` and return a unique value as dicussed above. – Eslam Ibrahim Jun 20 '20 at 23:07
1

That's a great question! My hunch is that this has to do with the fact that NumPy functions are implemented in C which makes it so fast. Python's None might not give you the same efficiency (or is probably translated into np.nan), while Pandas's pd.NA would likely be translated into NumPy's np.nan anyway, since Pandas requires NumPy. Haven't found resources to support my claims yet, though.

Benji
  • 549
  • 7
  • 22