One-Hot Encode numpy array with >2 dims

Question

I have a numpy array of shape (192, 224, 192, 1). The last dimension is the integer class that I would like to one hot encode. For example, if I have 12 classes I would like the of the resulting array to be (192, 224, 192, 12), with the last dimension being all zeros but a 1 at the index corresponding to the original value.

I can do this is naively with many for loops, but would like to know if there is a better way to do this.

score 2 · Answer 1 · answered Sep 11 '20 at 01:14

2

You can create a new zeros array and populate it with advanced indexing.

# sample array with 12 classes
np.random.seed(123)
a = np.random.randint(0, 12, (192, 224, 192, 1))

b = np.zeros((a.size, a.max() + 1))

# use advanced indexing to get one-hot encoding
b[np.arange(a.size), a.ravel()] = 1

# reshape to original form
b = b.reshape(a.shape[:-1] + (a.max() + 1,))

print(b.shape)
print(a[0, 0, 0])
print(b[0, 0, 0])

Output

(192, 224, 192, 12)
[2]
[0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]

Similar to this answer but with array reshaping.

answered Sep 11 '20 at 01:14

RichieV

5,103
2
11
24

The total index arrays are shorter if you don't reshape – Mad Physicist Sep 11 '20 at 03:16
1

@RitchieV. I've posted an answer and generalized it to arbitrary dimensions – Mad Physicist Sep 11 '20 at 03:55
Let me know if you get a chance to play with it. I posted from mobile, so there's no guarantee that the code is error free – Mad Physicist Sep 11 '20 at 04:02
This answer worked well for my problem. The only change I had to make is to change `a.max() + 1` to the number of classes I have. This particular ML problem is segmentation so this whole array is my label but not every class is represented in every label so it must be hardcoded. – PDPDPDPD Sep 11 '20 at 17:38
@PDPDPDPD consider upvoting Mad's answer, it is actually better performant and includes a generalized function, glad you fixed your code! – RichieV Sep 11 '20 at 18:38
Thanks for the advice. It ended up speeding up a substantial amount! – PDPDPDPD Sep 11 '20 at 20:30

Mad Physicist · Accepted Answer · 2020-09-11T14:59:12.063

You can do this in a single indexing operation if you know the max. Given an array a and m = a.max() + 1:

out = np.zeros(a.shape[:-1] + (m,), dtype=bool)
out[(*np.indices(a.shape[:-1], sparse=True), a[..., 0])] = True

It's easier if you remove the unnecessary trailing dimension:

a = np.squeeze(a)
out = np.zeros(a.shape + (m,), bool)
out[(*np.indices(a.shape, sparse=True), a)] = True

The explicit tuple in the index is necessary to do star expansion.

If you want to extend this to an arbitrary dimension, you can do that too. The following will insert a new dimension into the squeezed array at axis. Here axis is the position in the final array of the new axis, which is consistent with say np.stack, but not consistent with list.insert:

def onehot(a, axis=-1, dtype=bool):
    pos = axis if axis >= 0 else a.ndim + axis + 1
    shape = list(a.shape)
    shape.insert(pos, a.max() + 1)
    out = np.zeros(shape, dtype)
    ind = list(np.indices(a.shape, sparse=True))
    ind.insert(pos, a)
    out[tuple(ind)] = True
    return out

If you have a singleton dimension to expand, the generalized solution can find the first available singleton dimension:

def onehot2(a, axis=None, dtype=bool):
    shape = np.array(a.shape)
    if axis is None:
        axis = (shape == 1).argmax()
    if shape[axis] != 1:
        raise ValueError(f'Dimension at {axis} is non-singleton')
    shape[axis] = a.max() + 1
    out = np.zeros(shape, dtype)
    ind = list(np.indices(a.shape, sparse=True))
    ind[axis] = a
    out[tuple(ind)] = True
    return out

To use the last available singleton, replace axis = (shape == 1).argmax() with

axis = a.ndim - 1 - (shape[::-1] == 1).argmax()

Here are some example usages:

>>> np.random.seed(0x111)
>>> x = np.random.randint(5, size=(3, 2))
>>> x
array([[2, 3],
       [3, 1],
       [4, 0]])

>>> a = onehot(x, axis=-1, dtype=int)
>>> a.shape
(3, 2, 5)
>>> a
array([[[0, 0, 1, 0, 0],    # 2
        [0, 0, 0, 1, 0]],   # 3

       [[0, 0, 0, 1, 0],    # 3
        [0, 1, 0, 0, 0]],   # 1

       [[0, 0, 0, 0, 1],    # 4
        [1, 0, 0, 0, 0]]]   # 0

>>> b = onehot(x, axis=-2, dtype=int)
>>> b.shape
(3, 5, 2)
>>> b
array([[[0, 0],
        [0, 0],
        [1, 0],
        [0, 1],
        [0, 0]],

       [[0, 0],
        [0, 1],
        [0, 0],
        [1, 0],
        [0, 0]],

       [[0, 1],
        [0, 0],
        [0, 0],
        [0, 0],
        [1, 0]]])

onehot2 requires you to mark the dimension you want to add as a singleton:

>>> np.random.seed(0x111)
>>> y = np.random.randint(5, size=(3, 1, 2, 1))
>>> y
array([[[[2],
         [3]]],
       [[[3],
         [1]]],
       [[[4],
         [0]]]])

>>> c = onehot2(y, axis=-1, dtype=int)
>>> c.shape
(3, 1, 2, 5)
>>> c
array([[[[0, 0, 1, 0, 0],
         [0, 0, 0, 1, 0]]],

       [[[0, 0, 0, 1, 0],
         [0, 1, 0, 0, 0]]],

       [[[0, 0, 0, 0, 1],
         [1, 0, 0, 0, 0]]]])

>>> d = onehot2(y, axis=-2, dtype=int)
ValueError: Dimension at -2 is non-singleton

>>> e = onehot2(y, dtype=int)
>>> e.shape
(3, 5, 2, 1)
>>> e.squeeze()
array([[[0, 0],
        [0, 0],
        [1, 0],
        [0, 1],
        [0, 0]],

       [[0, 0],
        [0, 1],
        [0, 0],
        [1, 0],
        [0, 0]],

       [[0, 1],
        [0, 0],
        [0, 0],
        [0, 0],
        [1, 0]]])

Quite interesting to see `np.indices` being used, I need to get more experience with fancy indexing — RichieV, Sep 11 '20 at 05:05
@RichieV. I've reverted your edit. The indexing in `onehot` is done that way on purpose. It's meant to operate on `a.squeeze` rather than `a` in the question. But you were right about the bug :) — Mad Physicist, Sep 11 '20 at 13:32
@RichieV. I've added some examples to show how the two functions are used, in keeping with the spirit of your test. — Mad Physicist, Sep 11 '20 at 14:59
thanks for code. This worked out awesome and is super fast compared to some of the other answers. — PDPDPDPD, Sep 11 '20 at 20:31
@PDPDPDPD. RichieV's answer is pretty similar. I would benchmark it against mine if speed matters. Raveling and unravelling are very cheap since they don't copy memory around. — Mad Physicist, Sep 11 '20 at 21:34
Your code worked better I believe. I didn’t do extensive testing but It allowed me to eliminate an expand dimension call that I had before this. — PDPDPDPD, Sep 12 '20 at 04:56

score 0 · Answer 3 · answered Sep 11 '20 at 00:50

SciKit-learn has an encoder:

from sklearn.preprocessing import OneHotEncoder

# Data
values = np.array([1, 3, 2, 4, 1, 2, 1, 3, 5])
val_reshape = values.reshape(len(values), 1)

# One-hot encoding
oh = OneHotEncoder(sparse = False) 
oh_arr = oh.fit_transform(val_reshape)

print(oh_arr)

output: 
[[1. 0. 0. 0. 0.]
 [0. 0. 1. 0. 0.]
 [0. 1. 0. 0. 0.]
 [0. 0. 0. 1. 0.]
 [1. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0.]
 [1. 0. 0. 0. 0.]
 [0. 0. 1. 0. 0.]
 [0. 0. 0. 0. 1.]]

One-Hot Encode numpy array with >2 dims

3 Answers3