Map a Pandas Series of strings using index position in another array

Question

I have a Pandas Series like:

0    bar
1    foo
2    bar
3    bar
4    bar
5    foo

I would like to map this Series to another Series based on a numpy array specifying the order, [bar, foo]. Then the result should be:

How can I do that?

Background: I have a sklearn learner which maps categorical target internally to learner.classes_ numpy array with order of original classes. I am trying to implement some additional methods and I would need to map their input (the input Series above) using those classes_, each class to its index, because this is what is then internally used in the learner.

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.map.html — Raghav Patnecha, Jul 06 '18 at 07:45
Oh, I see, I have to make a mapping series with index as strings. — Mitar, Jul 06 '18 at 07:50

score 2 · Answer 1 · answered Jul 06 '18 at 08:22

You can use Categorical Data to specify a custom ordering via a list. Conversion to codes is possible via pd.Series.cat.codes:

df = pd.DataFrame({'s': ['bar', 'foo', 'bar', 'bar', 'bar', 'foo']})

orderList = ['bar', 'foo']

df['s'] = pd.Categorical(df['s'], categories=orderList, ordered=True)
df['s'] = df['s'].cat.codes

print(df)

   s
0  0
1  1
2  0
3  0
4  0
5  1

score 1 · Answer 2 · answered Jul 06 '18 at 07:52

1

OK, it seems this does it:

mapping_series = pandas.Series(range(len(classes)), index=classes)
output = input.map(mapping_series)

So the trick is that the strings should be the index of the mapping series. I was just trying output = input.map(pandas.Series(classes)) but this does not work.

answered Jul 06 '18 at 07:52

Mitar

6,756
5
54
86

You should look at the Categorical Data solution I posted. This is the natural solution available to Pandas. – jpp Jul 06 '18 at 08:37

Sreekiran A R · Answer 3 · 2018-07-06T08:16:00.623

0

you can convert categorical to numerical using the replace function

df=pd.Series(['aa','bb','aa'])
ref=np.array(['aa','bb'])
d=dict({str(r):i for (i,r) in enumerate(ref)})
df=df.replace(d)

edited Jul 06 '18 at 08:16

answered Jul 06 '18 at 07:59

Sreekiran A R

3,123
2
20
41

`replace` is [inefficient](https://stackoverflow.com/questions/49259580/replace-values-in-a-pandas-series-via-dictionary-efficiently) vs `map`. – jpp Jul 06 '18 at 08:38

score 0 · Answer 4 · answered Jul 06 '18 at 08:35

As sklearn trees depend on the way you integer encode the categories, you might want to custom encode the categories

df = pd.DataFrame({'the_column': ['bar', 'foo', 'bar', 'bar', 'bar', 'foo']})
cat_map = {'bar' :0 , 'foo' : 1}
df['category_map'] = df['the_column'].map(cat_map)
df.drop('the_column',axis = 1)
df.head()

score 0 · Answer 5 · answered Jul 06 '18 at 10:10

Internally all scikit estimators use LabelEncoder to encode the string class labels to integers. LabelEncoder by default uses numpy.unique to get all unique classes and numpy.unique will return them in alphabetical order.

You too can use that (or extend that) to fulfil your requirements.

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
le.fit(['bar', 'foo', 'bar', 'bar', 'bar', 'foo'])

le.classes_
#Output: array(['bar', 'foo'], dtype='|S3')

le.transform(['bar', 'foo', 'bar']) 
#Output: array([0, 1, 0])

le.inverse_transform([0, 1, 1])
#Output: array(['bar', 'foo', 'foo'], dtype='|S3')

Map a Pandas Series of strings using index position in another array

5 Answers5