Edit: maybe simple mapping to numbers could be faster and without collisions:
import hashlib
from numpy import array
features = array(['oklahoma', 'florida', 'idaho', 'pennsylvania', 'alabama','washington'], dtype=object)
numbers = range(0, len(features))
num2string = dict(zip(numbers, features))
string2num = dict(zip(features, numbers))
# read the result
for i in num2string:
print "%i => '%s'" % (i, num2string[i])
print "usage test:"
print string2num['oklahoma']
print num2string[string2num['oklahoma']]
You will get a simple sequence of numbers for every item in your array:
0 => 'oklahoma'
1 => 'florida'
2 => 'idaho'
Advantage: simplicity and speed
Disadvantage: You'll get different numbers for the same string if you change it's position in array, unlike with hashing the strings.
Usage of hashing
You can hash the string using some well chosen hask algorithm. You have to be careful about number of collisions for your hash function. If two data have the same hash, you would have like a duplicit number in your input. In this example, md5 hash function is used for the purpose:
import hashlib
from numpy import array
def string_to_num(s):
return int(hashlib.md5(s).hexdigest(), 16)
features = array(['oklahoma', 'florida', 'idaho', 'pennsylvania', 'alabama','washington'], dtype=object)
# hash those strings
features_string_for_number = {}
for i in features:
hash_number = string_to_num(i)
features_string_for_number[hash_number]=i
# read the result
for i in features_string_for_number:
print "%i => '%s'" % (i, features_string_for_number[i])
print "usage test:"
print string_to_num('oklahoma')
print features_string_for_number[string_to_num('oklahoma')]
The hashing part is taken from here.