1

I have a series that is categorical.

At the moment I am mapping to string using the following code.

import pandas as pd
import numpy as np
test = np.random.rand(int(5e6)) 
test[0] = np.nan          
test_cut = pd.cut(test,(-np.inf,0.2,0.4,np.inf))   
test_str = test_cut.astype('str')
test_str[test_str.isna()] = 'missing'

This astype('str') operation is very slow, is there a way to speed this up?

Based on the link below, I understand that apply is faster than astype. I tried the following.

test_str = test_cut.apply(str)    
#AttributeError: 'Categorical' object has no attribute 'apply'

test_str = test_cut.map(str)   
# still categorical type

test_str = test_cut.values.astype(str)  
# AttributeError: 'Categorical' object has no attribute 'values'

Converting a series of ints to strings - Why is apply much faster than astype?

I do not care about the exact string representations of the categories, only that the groups are preserved, and coverted to strings.

As an alternative, is there a way to define a new category in the test_cut categorical 'Missing' (or something else), and set the 'missing' cases in 'test' to this category?

# some code to create 'MISSING' category
test_cat[test_str.isna()] = 'MISSING'
oli5679
  • 1,709
  • 1
  • 22
  • 34

1 Answers1

1

Use, the labels parameter to generate strings instead of pd.Intevals:

breaks = [-np.inf, .2, .4, np.inf]
test_cut = pd.cut(test,breaks, labels=pd.IntervalIndex.from_breaks(breaks).astype(str)) 

Try timings with this code.

Scott Boston
  • 147,308
  • 15
  • 139
  • 187