0

I am tryng to create a one hot encoding (ohe) of a list of characters allowing for unobserved levels. Using answers from Convert array of indices to 1-hot encoded numpy array and Finding the index of an item given a list containing it in Python, the following does want I want:

# example data
# this is the full list including unobserved levels
av = list(map(chr, range(ord('a'), ord('z')+1))) 
# this is the vector to apply ohe
v = ['a', 'f', 'u'] 

# apply one hot encoding
ohe = np.zeros((len(v), len(av)))
for i in range(len(v)): ohe[i, av.index(v[i])] = 1
ohe

Is there a more standard/faster way to do this, noting that the second link above mentions the bottleneck of .index().

(scale of my problem: full vector (av) has ~1000 levels, and the values to ohe (v) is of length 0.5M. Thanks.

user2957945
  • 2,353
  • 2
  • 21
  • 40

1 Answers1

1

You could use a lookup dictionary:

# example data
# this is the full list including unobserved levels
av = list(map(chr, range(ord('a'), ord('z')+1)))
lookup = { v : i for i, v in enumerate(av)}

# this is the vector to apply ohe
v = ['a', 'f', 'u']

# apply one hot encoding
ohe = np.zeros((len(v), len(av)))
for i in range(len(v)):
    ohe[i, lookup[v[i]]] = 1

The complexity of .index is O(n) vs looking in a dictionary that is O(1). You can even save the for loop by doing:

indices = [lookup[vi] for vi in v]
ohe = np.zeros((len(v), len(av)))
ohe[np.arange(len(v)), indices] = 1
Dani Mesejo
  • 61,499
  • 6
  • 49
  • 76