2

I have a Pandas::Series object with repeated String values that I need to normalise into int values to feed into a TensorFlow.

I have looked at converting this into a Category as per this but it creates a code per item rather than identifying duplicates.

e.g. I wish for the following conversion

['a', 'b', 'c', 'd', 'a', 'a', 'c'] -> [1, 2, 3, 4, 1, 1, 3]
clicky
  • 865
  • 2
  • 14
  • 31

2 Answers2

3

You need a bit change factorize:

print ((pd.factorize(['a', 'b', 'c', 'd', 'a', 'a', 'c'])[0] + 1).tolist())
[1, 2, 3, 4, 1, 1, 3]
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
2

You need add cat.codes after convert to category

pd.Series(['a', 'b', 'c', 'd', 'a', 'a', 'c']).astype('category').cat.codes+1
Out[1407]: 
0    1
1    2
2    3
3    4
4    1
5    1
6    3
dtype: int8
BENY
  • 317,841
  • 20
  • 164
  • 234