I have a categorical variable in a series. I want to assign integer ids to each unique value and create a new series with the ids, effectively turning a string variable into an integer variable. What is the most compact/efficient way to do this?
Asked
Active
Viewed 3.0k times
2 Answers
40
You could use pandas.factorize:
In [32]: s = pd.Series(['a','b','c'])
In [33]: labels, levels = pd.factorize(s)
In [35]: labels
Out[35]: array([0, 1, 2])

unutbu
- 842,883
- 184
- 1,785
- 1,677
-
2Note that from 0.15 (to be released in the coming weeks), there will be more integrated categorical support, see http://pandas-docs.github.io/pandas-docs-travis/whatsnew.html#whatsnew-0150-cat – joris Sep 21 '14 at 20:22
19
Example using the new pandas categorical
type in pandas 0.15+
http://pandas.pydata.org/pandas-docs/version/0.16.2/categorical.html
In [553]: x = pd.Series(['a', 'a', 'a', 'b', 'b', 'c']).astype('category')
In [554]: x
Out[554]:
0 a
1 a
2 a
3 b
4 b
5 c
dtype: category
Categories (3, object): [
a
, b
, c]
In [555]: x.cat.codes
Out[555]:
0 0
1 0
2 0
3 1
4 1
5 2
dtype: int8

Daniel Golden
- 3,752
- 2
- 27
- 32