I have a series that is categorical.
At the moment I am mapping to string using the following code.
import pandas as pd
import numpy as np
test = np.random.rand(int(5e6))
test[0] = np.nan
test_cut = pd.cut(test,(-np.inf,0.2,0.4,np.inf))
test_str = test_cut.astype('str')
test_str[test_str.isna()] = 'missing'
This astype('str') operation is very slow, is there a way to speed this up?
Based on the link below, I understand that apply is faster than astype. I tried the following.
test_str = test_cut.apply(str)
#AttributeError: 'Categorical' object has no attribute 'apply'
test_str = test_cut.map(str)
# still categorical type
test_str = test_cut.values.astype(str)
# AttributeError: 'Categorical' object has no attribute 'values'
Converting a series of ints to strings - Why is apply much faster than astype?
I do not care about the exact string representations of the categories, only that the groups are preserved, and coverted to strings.
As an alternative, is there a way to define a new category in the test_cut categorical 'Missing' (or something else), and set the 'missing' cases in 'test' to this category?
# some code to create 'MISSING' category
test_cat[test_str.isna()] = 'MISSING'