One hot encoding a list of characters with unobserved levels

Question

I am tryng to create a one hot encoding (ohe) of a list of characters allowing for unobserved levels. Using answers from Convert array of indices to 1-hot encoded numpy array and Finding the index of an item given a list containing it in Python, the following does want I want:

# example data
# this is the full list including unobserved levels
av = list(map(chr, range(ord('a'), ord('z')+1))) 
# this is the vector to apply ohe
v = ['a', 'f', 'u'] 

# apply one hot encoding
ohe = np.zeros((len(v), len(av)))
for i in range(len(v)): ohe[i, av.index(v[i])] = 1
ohe

Is there a more standard/faster way to do this, noting that the second link above mentions the bottleneck of .index().

(scale of my problem: full vector (av) has ~1000 levels, and the values to ohe (v) is of length 0.5M. Thanks.

Dani Mesejo · Accepted Answer · 2019-10-14T15:25:38.650

1

You could use a lookup dictionary:

# example data
# this is the full list including unobserved levels
av = list(map(chr, range(ord('a'), ord('z')+1)))
lookup = { v : i for i, v in enumerate(av)}

# this is the vector to apply ohe
v = ['a', 'f', 'u']

# apply one hot encoding
ohe = np.zeros((len(v), len(av)))
for i in range(len(v)):
    ohe[i, lookup[v[i]]] = 1

The complexity of .index is O(n) vs looking in a dictionary that is O(1). You can even save the for loop by doing:

indices = [lookup[vi] for vi in v]
ohe = np.zeros((len(v), len(av)))
ohe[np.arange(len(v)), indices] = 1

edited Oct 14 '19 at 15:25

answered Oct 14 '19 at 15:20

Dani Mesejo

61,499
6
49
76

1

Glad I could help :) – Dani Mesejo Oct 14 '19 at 15:29

One hot encoding a list of characters with unobserved levels

1 Answers1