1

I have a Pandas Series like:

0    bar
1    foo
2    bar
3    bar
4    bar
5    foo

I would like to map this Series to another Series based on a numpy array specifying the order, [bar, foo]. Then the result should be:

0    0
1    1
2    0
3    0
4    0
5    1

How can I do that?

Background: I have a sklearn learner which maps categorical target internally to learner.classes_ numpy array with order of original classes. I am trying to implement some additional methods and I would need to map their input (the input Series above) using those classes_, each class to its index, because this is what is then internally used in the learner.

jpp
  • 159,742
  • 34
  • 281
  • 339
Mitar
  • 6,756
  • 5
  • 54
  • 86

5 Answers5

2

You can use Categorical Data to specify a custom ordering via a list. Conversion to codes is possible via pd.Series.cat.codes:

df = pd.DataFrame({'s': ['bar', 'foo', 'bar', 'bar', 'bar', 'foo']})

orderList = ['bar', 'foo']

df['s'] = pd.Categorical(df['s'], categories=orderList, ordered=True)
df['s'] = df['s'].cat.codes

print(df)

   s
0  0
1  1
2  0
3  0
4  0
5  1
jpp
  • 159,742
  • 34
  • 281
  • 339
1

OK, it seems this does it:

mapping_series = pandas.Series(range(len(classes)), index=classes)
output = input.map(mapping_series)

So the trick is that the strings should be the index of the mapping series. I was just trying output = input.map(pandas.Series(classes)) but this does not work.

Mitar
  • 6,756
  • 5
  • 54
  • 86
  • You should look at the Categorical Data solution I posted. This is the natural solution available to Pandas. – jpp Jul 06 '18 at 08:37
0

you can convert categorical to numerical using the replace function

df=pd.Series(['aa','bb','aa'])
ref=np.array(['aa','bb'])
d=dict({str(r):i for (i,r) in enumerate(ref)})
df=df.replace(d)
Sreekiran A R
  • 3,123
  • 2
  • 20
  • 41
  • `replace` is [inefficient](https://stackoverflow.com/questions/49259580/replace-values-in-a-pandas-series-via-dictionary-efficiently) vs `map`. – jpp Jul 06 '18 at 08:38
0

As sklearn trees depend on the way you integer encode the categories, you might want to custom encode the categories

df = pd.DataFrame({'the_column': ['bar', 'foo', 'bar', 'bar', 'bar', 'foo']})
cat_map = {'bar' :0 , 'foo' : 1}
df['category_map'] = df['the_column'].map(cat_map)
df.drop('the_column',axis = 1)
df.head()
Fenil
  • 396
  • 1
  • 5
  • 16
0

Internally all scikit estimators use LabelEncoder to encode the string class labels to integers. LabelEncoder by default uses numpy.unique to get all unique classes and numpy.unique will return them in alphabetical order.

You too can use that (or extend that) to fulfil your requirements.

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
le.fit(['bar', 'foo', 'bar', 'bar', 'bar', 'foo'])

le.classes_
#Output: array(['bar', 'foo'], dtype='|S3')

le.transform(['bar', 'foo', 'bar']) 
#Output: array([0, 1, 0])

le.inverse_transform([0, 1, 1])
#Output: array(['bar', 'foo', 'foo'], dtype='|S3')
Vivek Kumar
  • 35,217
  • 8
  • 109
  • 132