9

I'm new to data analytics. I'm trying some models in python Sklearn. I have a dataset in which some of the columns have text columns. Like below,

Dataset

Is there a way to convert these column values into numbers in pandas or Sklearn?. Assigning numbers to these values will be right?. And what if a new string pops out in test data?.

Please advice.

Sumithran
  • 6,217
  • 4
  • 40
  • 54
Selva Saravana Er
  • 199
  • 1
  • 1
  • 6
  • consider using [get_dummies](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html) function available in pandas. Ignore all new values encountered in test data, you cannot use values which was not seen in during training. – shanmuga Jan 21 '16 at 05:15
  • i was thinking of using it. but some of the columns have many unique values (upto 400+). – Selva Saravana Er Jan 21 '16 at 05:23

3 Answers3

3

Consider using Label Encoding - it transforms the categorical data by assigning each category an integer between 0 and the num_of_categories-1:

from sklearn.preprocessing import LabelEncoder
df = pd.DataFrame(['a','b','c','d','a','c','a','d'], columns=['letter'])

  letter
0      a
1      b
2      c
3      d
4      a
5      c
6      a

Applying:

le = LabelEncoder()
encoded_series = df[df.columns[:]].apply(le.fit_transform)

encoded_series:

    letter
0   0
1   1
2   2
3   3
4   0
5   2
6   0
7   3
Amir F
  • 2,431
  • 18
  • 12
  • How would you apply this to prediction data to get the matching letter number? e.g. when I want to predict `d` it has to be converted to `3` from your example. – STIKO Sep 26 '18 at 02:43
  • If I am understanding you correctly - you can keep a copy of the original values on 'the side' for reference. You will be able to convert back to letters if needed. I hope this is helpful - in case its not please clarify what you are trying to do. – Amir F Sep 27 '18 at 06:31
  • So, let's use your example as my dataset for simplicity and let's pretend there is a target column (we don't care about it for this example), before I train my model on it, I convert it to numbers, then, I train my model on it. Now I have a trained model. Now I want to feed my model with a feature `c` to get a prediction. From your example `c` was converted to `2` (easy since I can look at it), so I need to feed my model with `2` to get my prediction. The question is how do I get `2` for `c`? – STIKO Sep 27 '18 at 22:24
  • you can toggle back and forth (2 to c and back) with np.where. Its as simple as 'if' in excel.(https://medium.com/@emayoung95/using-numpy-where-function-to-replace-for-loops-with-if-else-statements-a1e6044ac4c1) – Amir F Sep 30 '18 at 08:05
  • 1
    This may be helpful as well - https://stackoverflow.com/questions/24458645/label-encoding-across-multiple-columns-in-scikit-learn – Amir F Oct 01 '18 at 09:33
  • That's exactly what I was looking for. Thanks a bunch @Amir – STIKO Oct 02 '18 at 14:22
0

You can convert them into integer codes by using the categorical datatype.

column = column.astype('category')
column_encoded = column.cat.codes

As long as use use a tree based model with deep enough trees, eg GradientBoostingClassifier(max_depth=10), your model should be able to split out the categories again.

maxymoo
  • 35,286
  • 11
  • 92
  • 119
0

I think it would be better to use OrdinalEncoder if you want to transform feature columns, because it's meant for categorical features (LabelEncoder is meant for labels). Also, it can handle values not seen in training and multiple features at the same time. An example:

from sklearn.preprocessing import OrdinalEncoder

features = ["city", "age", ...]
encoder = OrdinalEncoder(
        handle_unknown='use_encoded_value', 
        unknown_value=-1
    ).fit(train[features])
train[features] = encoder.transform(train[features])
test[features] = encoder.transform(test[features])

More on the documentation: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html

scepeda
  • 451
  • 7
  • 14