I have missing values (in fact pandas.NA
). The problem is that they are not shown when using pandas.crosstab()
. I can offer a workaround and would like to know if this is OK this way or if there is a better way.
This is not a duplicate of Missing data in pandas.crosstab but maybe related. There are also some maybe related bugreports around. But I am not sure if they discuss the same problem because I found nowhere a <NA>
value in the examples there. Maybe someone can give estimate if my problem is related to the bugs or not.
The raw data...
YEAR FOO
0 2001 B
1 2001 B
2 2002 A
3 2000 NaN
...resulting in this crosstab with missing `NA:
YEAR 2000 2001 2002 All
FOO
A 0 0 1 1
B 0 2 0 2
All 1 2 1 4
I would expect
YEAR 2000 2001 2002 All
FOO
A 0 0 1 1
B 0 2 0 2
<NA> 1 0 0 1
All 1 2 1 4
This is the MWE producing the problem.
#!/usr/bin/env python3
import pandas as pd
data = {'YEAR': [2001, 2001, 2002, 2000],
'FOO': ['B', 'B', 'A', pd.NA]}
df = pd.DataFrame(data)
df.FOO = df.FOO.astype('category')
print(df)
tab = pd.crosstab(df.FOO, df.YEAR, margins=True, dropna=False)
print(tab)
My workaround is to treat the missing values as normal values and doing some conversions.
# Workaround
df.FOO = df.FOO.astype('string')
df.FOO = df.FOO.fillna('') # you could "use" missing also
df.FOO = df.FOO.astype('category')
print(df)
tab = pd.crosstab(df.FOO, df.YEAR, margins=True, dropna=False)
print(tab)
Gives the output:
YEAR FOO
0 2001 B
1 2001 B
2 2002 A
3 2000
YEAR 2000 2001 2002 All
FOO
1 0 0 1
A 0 0 1 1
B 0 2 0 2
All 1 2 1 4
EDIT: The questions comes up why I use the column as category
. The answer is that while do read_csv()
I treat the column via dtypes=
as category
by default. This is important because of performance and memory usage when using CSV files with millions of lines. So inittialy I always have a category
column.