0

I have missing values (in fact pandas.NA). The problem is that they are not shown when using pandas.crosstab(). I can offer a workaround and would like to know if this is OK this way or if there is a better way.

This is not a duplicate of Missing data in pandas.crosstab but maybe related. There are also some maybe related bugreports around. But I am not sure if they discuss the same problem because I found nowhere a <NA> value in the examples there. Maybe someone can give estimate if my problem is related to the bugs or not.

The raw data...

   YEAR  FOO
0  2001    B
1  2001    B
2  2002    A
3  2000  NaN

...resulting in this crosstab with missing `NA:

YEAR  2000  2001  2002  All
FOO
A        0     0     1    1
B        0     2     0    2
All      1     2     1    4

I would expect

YEAR  2000  2001  2002  All
FOO
A        0     0     1    1
B        0     2     0    2
<NA>     1     0     0    1
All      1     2     1    4

This is the MWE producing the problem.

#!/usr/bin/env python3
import pandas as pd

data = {'YEAR': [2001, 2001, 2002, 2000],
        'FOO': ['B', 'B', 'A', pd.NA]}

df = pd.DataFrame(data)
df.FOO = df.FOO.astype('category')
print(df)
tab = pd.crosstab(df.FOO, df.YEAR, margins=True, dropna=False)
print(tab)

My workaround is to treat the missing values as normal values and doing some conversions.

# Workaround
df.FOO = df.FOO.astype('string')
df.FOO = df.FOO.fillna('')  # you could "use" missing also
df.FOO = df.FOO.astype('category')
print(df)

tab = pd.crosstab(df.FOO, df.YEAR, margins=True, dropna=False)
print(tab)

Gives the output:

   YEAR FOO
0  2001   B
1  2001   B
2  2002   A
3  2000

YEAR  2000  2001  2002  All
FOO
         1     0     0    1
A        0     0     1    1
B        0     2     0    2
All      1     2     1    4

EDIT: The questions comes up why I use the column as category. The answer is that while do read_csv() I treat the column via dtypes= as category by default. This is important because of performance and memory usage when using CSV files with millions of lines. So inittialy I always have a category column.

buhtz
  • 10,774
  • 18
  • 76
  • 149
  • 2
    looks more like a bug. Probably should mention it on pandas' github issues – sammywemmy Jul 14 '21 at 11:38
  • Another workaround could be through `value_counts(dropna=False)` and then adding margins manually but not sure if it's better than your current workaround. – Mustafa Aydın Jul 14 '21 at 11:47

3 Answers3

1

One option would be to add_categories to FOO then fillna with the new added NaN representation:

df['FOO'] = df['FOO'].cat.add_categories(['<NA>']).fillna('<NA>')
tab = pd.crosstab(df['FOO'], df['YEAR'], margins=True)

tab:

YEAR  2000  2001  2002  All
FOO                        
A        0     0     1    1
B        0     2     0    2
<NA>     1     0     0    1
All      1     2     1    4
Henry Ecker
  • 34,399
  • 18
  • 41
  • 57
0

Why you are converting it into category.

import pandas as pd
data = {'YEAR': [2001, 2001, 2002, 2000],
        'FOO': ['B', 'B', 'A', pd.NA]}
df = pd.DataFrame(data)
print(df)
df["FOO"].fillna("NA",inplace=True)
tab=pd.crosstab(df.FOO,df.YEAR,margins=True)
print(tab)
  • 1
    `fillna` would work even with categorical data. You'd just have to add the category. This is also the same as OP's listed "workaround" no? – Henry Ecker Jul 14 '21 at 11:49
  • @HenryEcker Not exactly the same as I did because I converted the column to `str` which would cause huge memory usage. How would I add a category `"(missing)"` to that column? – buhtz Jul 14 '21 at 11:53
0

I would simply replace the NAs with strings using fillna

tab = pd.crosstab(df['FOO'].fillna('NA'), df['YEAR'], margins=True)

tab:

YEAR  2000  2001  2002  All
FOO                        
A        0     0     1    1
B        0     2     0    2
NA       1     0     0    1
All      1     2     1    4
Stu
  • 1,543
  • 3
  • 17
  • 31