I am tryng to create a one hot encoding (ohe) of a list of characters allowing for unobserved levels. Using answers from Convert array of indices to 1-hot encoded numpy array and Finding the index of an item given a list containing it in Python, the following does want I want:
# example data
# this is the full list including unobserved levels
av = list(map(chr, range(ord('a'), ord('z')+1)))
# this is the vector to apply ohe
v = ['a', 'f', 'u']
# apply one hot encoding
ohe = np.zeros((len(v), len(av)))
for i in range(len(v)): ohe[i, av.index(v[i])] = 1
ohe
Is there a more standard/faster way to do this, noting that the second link above mentions the bottleneck of .index()
.
(scale of my problem: full vector (av) has ~1000 levels, and the values to ohe (v) is of length 0.5M. Thanks.