0

Often times I have to convert even continuous data into a categorical datatype, since it helps my statistical analysis.

When I apply boolean indexing (values < 11) to categorical columns, they are not sliced as expected:

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

### MAKE TESTDATA
df = sns.load_dataset("fmri")

df["timepoint"] = pd.Categorical(df["timepoint"], ordered=True)

### PERFORM BOOLEAN SLICING
df = df.loc[df["timepoint"] < 11]
# df = df.where(df["timepoint"] < 11)  # SAME RESULT

g = sns.catplot(data=df, y="signal", x="timepoint")

This yields incorrect plots. The x-axis still goes over 11, while the datapoints were correctly sliced away:

Cause:

The categorical data was sliced, BUT its index ("categories") ignored the slicing operation. Pandas seems to use the index to display the x-axis.

>>> print(df.timepoint.cat.categories)
Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18], dtype='int64')

What would make it work:

Performing the slicing BEFORE converting to categorical leads to the desired behavior. So does converting the categorical type back to numerical and then again to categorical. HOWEVER. I doubt that this is they way it is intended.

Question:

Is there an elegant way to slice by categorical column that removes "unused" categories (without changing datatypes back and forth)?

markur
  • 147
  • 8

1 Answers1

1

Pandas intentionally keeps "unused" categories. One can drop them using

df["timepoint"] = df["timepoint"].cat.remove_unused_categories()
markur
  • 147
  • 8