Looking for a way to preprocess string features

Question

For a machine learning problem I have for every sample a location feature( a state in America), which looks like this: The whole feature vector looks like this:

array(['oklahoma', 'florida', 'idaho', ..., 'pennsylvania', 'alabama',
   'washington'], dtype=object)

I cannot directly feed this in a sklearn algorithm and therefore I have to somehow convert this into numerical features, but I don't know how I could do this. What are they best ways to convert these string features? Would ASCII conversion work?

edit: I want my every state to have its own unique numerical value.

Do you want to put geographically near cities into groups? What do you want to achieve? — nio, Nov 29 '13 at 16:51
Actually for starters I just want every state to have its own unique numerical value. But I would be interested in a geographic technique. — Learner, Nov 29 '13 at 16:53

score 6 · Answer 1 · answered Nov 29 '13 at 17:05

You can refer to Label preprocessing:

from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(['oklahoma', 'florida', 'idaho', 'pennsylvania', 'alabama',
     'washington'])
le.classes_
# array(['alabama', 'florida', 'idaho', 'oklahoma', 'pennsylvania',
#         'washington'],
#       dtype='|S12')
le.transform(["oklahoma"])
# array([3])

neil · Accepted Answer · 2013-11-29T17:14:42.637

If you just want to turn each city name into a unique numerical value then hash(text) would work well.

It may be that a more complex hash function is needed as this is not guaranteed to be the same every time Python is run. In fact in Python 3.3 it will be salted differently each time unless you specifically set it up to do otherwise. The hashlib module contains various different hash algorithms that may suit better.

score 3 · Answer 3 · edited May 23 '17 at 12:19

Edit: maybe simple mapping to numbers could be faster and without collisions:

import hashlib
from numpy import array

features = array(['oklahoma', 'florida', 'idaho', 'pennsylvania', 'alabama','washington'], dtype=object)

numbers = range(0, len(features))
num2string = dict(zip(numbers, features))
string2num = dict(zip(features, numbers))

# read the result
for i in num2string:
    print "%i => '%s'" % (i, num2string[i])

print "usage test:"
print string2num['oklahoma']
print num2string[string2num['oklahoma']]

You will get a simple sequence of numbers for every item in your array:

0 => 'oklahoma'
1 => 'florida'
2 => 'idaho'

Advantage: simplicity and speed Disadvantage: You'll get different numbers for the same string if you change it's position in array, unlike with hashing the strings.

Usage of hashing

You can hash the string using some well chosen hask algorithm. You have to be careful about number of collisions for your hash function. If two data have the same hash, you would have like a duplicit number in your input. In this example, md5 hash function is used for the purpose:

import hashlib
from numpy import array


def string_to_num(s):
    return int(hashlib.md5(s).hexdigest(), 16)

features = array(['oklahoma', 'florida', 'idaho', 'pennsylvania', 'alabama','washington'], dtype=object)

# hash those strings
features_string_for_number = {}
for i in features:
    hash_number = string_to_num(i)
    features_string_for_number[hash_number]=i

# read the result
for i in features_string_for_number:
    print "%i => '%s'" % (i, features_string_for_number[i])

print "usage test:"
print string_to_num('oklahoma')
print features_string_for_number[string_to_num('oklahoma')]

The hashing part is taken from here.

Looking for a way to preprocess string features

3 Answers3