Convert array of indices to one-hot encoded array in NumPy

Question

Given a 1D array of indices:

a = array([1, 0, 3])

I want to one-hot encode this as a 2D array:

b = array([[0,1,0,0], [1,0,0,0], [0,0,0,1]])

score 535 · Accepted Answer · edited Aug 06 '22 at 21:24

535

Create a zeroed array b with enough columns, i.e. a.max() + 1.
Then, for each row i, set the a[i]th column to 1.

>>> a = np.array([1, 0, 3])
>>> b = np.zeros((a.size, a.max() + 1))
>>> b[np.arange(a.size), a] = 1

>>> b
array([[ 0.,  1.,  0.,  0.],
       [ 1.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  1.]])

edited Aug 06 '22 at 21:24

Mateen Ulhaq

24,552
19
101
135

answered Apr 23 '15 at 18:30

YXD

31,741
15
75
115

14

@JamesAtwood it depends on the application but I'd make the max a parameter and not calculate it from the data. – Mohammad Moghimi Feb 08 '16 at 20:40
8

what if 'a' was 2d? and you want a 3-d one-hot matrix? – A.D Oct 18 '17 at 22:39
14

Can anyone point to an explanation of why this works, but the slice with [:, a] does not? – N. McA. Feb 16 '18 at 19:40
4

@ A.D. Solution for the 2d -> 3d case: https://stackoverflow.com/questions/36960320/convert-a-2d-matrix-to-a-3d-one-hot-matrix-numpy – cgnorthcutt Sep 29 '18 at 02:37
You can also use scipy.sparse. – mathtick Apr 08 '19 at 20:17
@N.McA.I had the same question, and found the part of the documentation where multi dimensional indexing with arrays is explained: https://numpy.org/doc/stable/user/basics.indexing.html#indexing-multi-dimensional-arrays – Marcos Pereira Aug 19 '20 at 10:08

K3---rnc · Answer 2 · 2019-04-08T22:15:32.913

268

>>> values = [1, 0, 3]
>>> n_values = np.max(values) + 1
>>> np.eye(n_values)[values]
array([[ 0.,  1.,  0.,  0.],
       [ 1.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  1.]])

edited Apr 08 '19 at 22:15

answered May 19 '16 at 12:35

K3---rnc

6,717
3
31
46

15

This solution is the only one useful for an input N-D matrix to one-hot N+1D matrix. Example: input_matrix=np.asarray([[0,1,1] , [1,1,2]]) ; np.eye(3)[input_matrix] # output 3D tensor – Isaías Mar 21 '17 at 16:06
10

+1 because this should be preferred over the accepted solution. For a more general solution though, `values` should be a Numpy array rather than a Python list, then it works in all dimensions, not only in 1D. – Alex Oct 21 '17 at 20:32
14

Note that taking `np.max(values) + 1` as number of buckets might not be desirable if your data set is say randomly sampled and just by chance it may not contain max value. Number of buckets should be rather a parameter and assertion/check can be in place to check that each value is within 0 (incl) and buckets count (excl). – NightElfik Jan 19 '18 at 03:46
3

To me this solution is the best and can be easily generalized to any tensor: def one_hot(x, depth=10): return np.eye(depth)[x]. Note that giving the tensor x as index returns a tensor of x.shape eye rows. – cecconeurale Mar 27 '18 at 07:37
9

Easy way to "understand" this solution and why it works for N-dims (without reading `numpy` docs): at each location in the original matrix (`values`), we have an integer `k`, and we "put" the 1-hot vector `eye(n)[k]` in that location. This adds a dimension because we're "putting" a vector in the location of a scalar in the original matrix. – avivr Sep 24 '19 at 14:08
2

For those wondering, benchmarks show that this code is just slightly slower than the accepted answer (https://stackoverflow.com/a/29831596/8729073). – Alexandre Huat Oct 20 '20 at 11:33

score 57 · Answer 3 · edited Jun 15 '18 at 15:57

57

In case you are using keras, there is a built in utility for that:

from keras.utils.np_utils import to_categorical   

categorical_labels = to_categorical(int_labels, num_classes=3)

And it does pretty much the same as @YXD's answer (see source-code).

edited Jun 15 '18 at 15:57

Berriel

12,659
4
43
67

answered Nov 27 '17 at 11:13

Jodo

4,515
6
38
50

score 53 · Answer 4 · edited Aug 20 '18 at 11:10

53

Here is what I find useful:

def one_hot(a, num_classes):
  return np.squeeze(np.eye(num_classes)[a.reshape(-1)])

Here num_classes stands for number of classes you have. So if you have a vector with shape of (10000,) this function transforms it to (10000,C). Note that a is zero-indexed, i.e. one_hot(np.array([0, 1]), 2) will give [[1, 0], [0, 1]].

Exactly what you wanted to have I believe.

PS: the source is Sequence models - deeplearning.ai

edited Aug 20 '18 at 11:10

Augustin

2,444
23
24

answered Mar 11 '18 at 07:41

D.Samchuk

1,219
9
9

2

also, what's the reason of doing np.squeeze() since get the (vector a's size) many one hot encoded arrays using `np.eye(num_classes)[a.reshape(-1)]. What you are simply doing is using `np.eye` you are creating a diagonal matrix with each class index as 1 rest zero and later using the indexes provided by `a.reshape(-1)` producing the output corresponding to the index in `np.eye()`. I didn't understand the need of `np.sqeeze` since we use it to simply remove single dimensions which we will never have as in the output's dimension will always be `(a_flattened_size, num_classes)` – Anu Mar 14 '19 at 05:07

score 46 · Answer 5 · edited Jul 05 '19 at 13:52

46

You can also use eye function of numpy:

numpy.eye(number of classes)[vector containing the labels]

edited Jul 05 '19 at 13:52

Rishabh Agrahari

3,447
2
21
22

answered Apr 12 '18 at 07:14

Karma

611
7
9

13

For more clarity using `np.identity(num_classes)[indices]` might be better. Nice answer! – Oliver Sep 02 '19 at 11:13
1

That's the only absolutely pythonic answer in all its brevity. – Maksym Ganenko Jun 07 '21 at 09:59
3

This has repeated the answer of K3---rnc two years later, and nobody seems to see it. – questionto42 Jul 16 '21 at 00:15
Also consider reshape the vector containing the labels `numpy.eye(num_class)[labels.reshape(-1)]`. So for example the labels dimension is (x,1) then it will not produce (num_class, x, 1) dimension. – Péter Szilvási Jul 22 '22 at 12:53

score 31 · Answer 6 · answered Feb 16 '17 at 02:15

You can use sklearn.preprocessing.LabelBinarizer:

Example:

import sklearn.preprocessing
a = [1,0,3]
label_binarizer = sklearn.preprocessing.LabelBinarizer()
label_binarizer.fit(range(max(a)+1))
b = label_binarizer.transform(a)
print('{0}'.format(b))

output:

[[0 1 0 0]
 [1 0 0 0]
 [0 0 0 1]]

Amongst other things, you may initialize sklearn.preprocessing.LabelBinarizer() so that the output of transform is sparse.

Shubham Mishra · Answer 7 · 2020-04-10T23:42:09.967

8

For 1-hot-encoding

   one_hot_encode=pandas.get_dummies(array)

For Example

ENJOY CODING

edited Apr 10 '20 at 23:42

answered Apr 10 '20 at 23:27

Shubham Mishra

834
6
8

1

Thanks for the comment, but a brief description of what the code is doing would be very helpful! – Clarus Apr 10 '20 at 23:33
please refer the example – Shubham Mishra Apr 10 '20 at 23:47
@Clarus Checkout the below example. You can access the one hot encoding of each value in your np array by doing a one_hot_encode[value]. `>>> import numpy as np >>> import pandas >>> a = np.array([1,0,3]) >>> one_hot_encode=pandas.get_dummies(a) >>> print(one_hot_encode) 0 1 3 0 0 1 0 1 1 0 0 2 0 0 1 >>> print(one_hot_encode[1]) 0 1 1 0 2 0 Name: 1, dtype: uint8 >>> print(one_hot_encode[0]) 0 0 1 1 2 0 Name: 0, dtype: uint8 >>> print(one_hot_encode[3]) 0 0 1 0 2 1 Name: 3, dtype: uint8` – Deepak Apr 11 '20 at 04:20
Not the ideal tool – PigSpider Feb 16 '22 at 09:50
welcome to stackoverflow. Generally it's preferred to make the answers self-contained, i.e. copy the example into your answer, rather than just linking to it. – Hugh Perkins Aug 18 '22 at 00:40

score 6 · Answer 8 · answered Sep 14 '16 at 00:02

Here is a function that converts a 1-D vector to a 2-D one-hot array.

#!/usr/bin/env python
import numpy as np

def convertToOneHot(vector, num_classes=None):
    """
    Converts an input 1-D vector of integers into an output
    2-D array of one-hot vectors, where an i'th input value
    of j will set a '1' in the i'th row, j'th column of the
    output array.

    Example:
        v = np.array((1, 0, 4))
        one_hot_v = convertToOneHot(v)
        print one_hot_v

        [[0 1 0 0 0]
         [1 0 0 0 0]
         [0 0 0 0 1]]
    """

    assert isinstance(vector, np.ndarray)
    assert len(vector) > 0

    if num_classes is None:
        num_classes = np.max(vector)+1
    else:
        assert num_classes > 0
        assert num_classes >= np.max(vector)

    result = np.zeros(shape=(len(vector), num_classes))
    result[np.arange(len(vector)), vector] = 1
    return result.astype(int)

Below is some example usage:

>>> a = np.array([1, 0, 3])

>>> convertToOneHot(a)
array([[0, 1, 0, 0],
       [1, 0, 0, 0],
       [0, 0, 0, 1]])

>>> convertToOneHot(a, num_classes=10)
array([[0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
       [1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 1, 0, 0, 0, 0, 0, 0]])

Note that this only works on vectors (and there is no `assert` to check vector shape ;) ). — johndodo, May 12 '17 at 21:01
+1 for the generalized approach and parameters check. However, as a common practice, I suggest to NOT use asserts to perform checks on inputs. Use asserts only to verify internal intermediate conditions. Rather, convert all `assert ___` into `if not ___ raise Exception()`. — fnunnari, Sep 23 '19 at 08:19

score 5 · Answer 9 · answered May 26 '19 at 15:29

5

You can use the following code for converting into a one-hot vector:

let x is the normal class vector having a single column with classes 0 to some number:

import numpy as np
np.eye(x.max()+1)[x]

if 0 is not a class; then remove +1.

answered May 26 '19 at 15:29

Inaam Ilahi

105
2
9

3

This repeats the answer of K3---rnc three years later. – questionto42 Jul 16 '21 at 00:17

score 2 · Answer 10 · answered Oct 11 '16 at 22:26

I think the short answer is no. For a more generic case in n dimensions, I came up with this:

# For 2-dimensional data, 4 values
a = np.array([[0, 1, 2], [3, 2, 1]])
z = np.zeros(list(a.shape) + [4])
z[list(np.indices(z.shape[:-1])) + [a]] = 1

I am wondering if there is a better solution -- I don't like that I have to create those lists in the last two lines. Anyway, I did some measurements with timeit and it seems that the numpy-based (indices/arange) and the iterative versions perform about the same.

score 2 · Answer 11 · answered Jan 17 '18 at 14:08

Just to elaborate on the excellent answer from K3---rnc, here is a more generic version:

def onehottify(x, n=None, dtype=float):
    """1-hot encode x with the max value n (computed from data if n is None)."""
    x = np.asarray(x)
    n = np.max(x) + 1 if n is None else n
    return np.eye(n, dtype=dtype)[x]

Also, here is a quick-and-dirty benchmark of this method and a method from the currently accepted answer by YXD (slightly changed, so that they offer the same API except that the latter works only with 1D ndarrays):

def onehottify_only_1d(x, n=None, dtype=float):
    x = np.asarray(x)
    n = np.max(x) + 1 if n is None else n
    b = np.zeros((len(x), n), dtype=dtype)
    b[np.arange(len(x)), x] = 1
    return b

The latter method is ~35% faster (MacBook Pro 13 2015), but the former is more general:

>>> import numpy as np
>>> np.random.seed(42)
>>> a = np.random.randint(0, 9, size=(10_000,))
>>> a
array([6, 3, 7, ..., 5, 8, 6])
>>> %timeit onehottify(a, 10)
188 µs ± 5.03 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
>>> %timeit onehottify_only_1d(a, 10)
139 µs ± 2.78 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

score 2 · Answer 12 · answered Oct 20 '20 at 11:11

If using tensorflow, there is one_hot():

import tensorflow as tf
import numpy as np

a = np.array([1, 0, 3])
depth = 4
b = tf.one_hot(a, depth)
# <tf.Tensor: shape=(3, 3), dtype=float32, numpy=
# array([[0., 1., 0.],
#        [1., 0., 0.],
#        [0., 0., 0.]], dtype=float32)>

TeeTracker · Answer 13 · 2021-05-09T21:13:13.150

2

def one_hot(n, class_num, col_wise=True):
  a = np.eye(class_num)[n.reshape(-1)]
  return a.T if col_wise else a

# Column for different hot
print(one_hot(np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 9, 9, 9, 9, 8, 7]), 10))
# Row for different hot
print(one_hot(np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 9, 9, 9, 9, 8, 7]), 10, col_wise=False))

edited May 09 '21 at 21:13

answered May 09 '21 at 20:53

TeeTracker

7,064
8
40
46

Jon Deaton · Answer 14 · 2022-02-03T00:12:18.547

2

I find the easiest solution combines np.take and np.eye

def one_hot(x, depth: int):
  return np.take(np.eye(depth), x, axis=0)

works for x of any shape.

edited Feb 03 '22 at 00:12

answered Feb 03 '22 at 00:05

Jon Deaton

3,943
6
28
41

score 1 · Answer 15 · answered Jan 25 '18 at 13:10

I recently ran into a problem of same kind and found said solution which turned out to be only satisfying if you have numbers that go within a certain formation. For example if you want to one-hot encode following list:

all_good_list = [0,1,2,3,4]

go ahead, the posted solutions are already mentioned above. But what if considering this data:

problematic_list = [0,23,12,89,10]

If you do it with methods mentioned above, you will likely end up with 90 one-hot columns. This is because all answers include something like n = np.max(a)+1. I found a more generic solution that worked out for me and wanted to share with you:

import numpy as np
import sklearn
sklb = sklearn.preprocessing.LabelBinarizer()
a = np.asarray([1,2,44,3,2])
n = np.unique(a)
sklb.fit(n)
b = sklb.transform(a)

I hope someone encountered same restrictions on above solutions and this might come in handy

eqzx · Answer 16 · 2018-07-30T01:42:42.350

Here's a dimensionality-independent standalone solution.

This will convert any N-dimensional array arr of nonnegative integers to a one-hot N+1-dimensional array one_hot, where one_hot[i_1,...,i_N,c] = 1 means arr[i_1,...,i_N] = c. You can recover the input via np.argmax(one_hot, -1)

def expand_integer_grid(arr, n_classes):
    """

    :param arr: N dim array of size i_1, ..., i_N
    :param n_classes: C
    :returns: one-hot N+1 dim array of size i_1, ..., i_N, C
    :rtype: ndarray

    """
    one_hot = np.zeros(arr.shape + (n_classes,))
    axes_ranges = [range(arr.shape[i]) for i in range(arr.ndim)]
    flat_grids = [_.ravel() for _ in np.meshgrid(*axes_ranges, indexing='ij')]
    one_hot[flat_grids + [arr.ravel()]] = 1
    assert((one_hot.sum(-1) == 1).all())
    assert(np.allclose(np.argmax(one_hot, -1), arr))
    return one_hot

score 1 · Answer 17 · answered Aug 30 '18 at 06:36

1

Such type of encoding are usually part of numpy array. If you are using a numpy array like this :

a = np.array([1,0,3])

then there is very simple way to convert that to 1-hot encoding

out = (np.arange(4) == a[:,None]).astype(np.float32)

That's it.

answered Aug 30 '18 at 06:36

Sudeep K Rana

299
3
3

score 1 · Answer 18 · answered Nov 03 '18 at 10:17

p will be a 2d ndarray.
We want to know which value is the highest in a row, to put there 1 and everywhere else 0.

clean and easy solution:

max_elements_i = np.expand_dims(np.argmax(p, axis=1), axis=1)
one_hot = np.zeros(p.shape)
np.put_along_axis(one_hot, max_elements_i, 1, axis=1)

score 0 · Answer 19 · answered Jan 06 '18 at 18:12

Here is an example function that I wrote to do this based upon the answers above and my own use case:

def label_vector_to_one_hot_vector(vector, one_hot_size=10):
    """
    Use to convert a column vector to a 'one-hot' matrix

    Example:
        vector: [[2], [0], [1]]
        one_hot_size: 3
        returns:
            [[ 0.,  0.,  1.],
             [ 1.,  0.,  0.],
             [ 0.,  1.,  0.]]

    Parameters:
        vector (np.array): of size (n, 1) to be converted
        one_hot_size (int) optional: size of 'one-hot' row vector

    Returns:
        np.array size (vector.size, one_hot_size): converted to a 'one-hot' matrix
    """
    squeezed_vector = np.squeeze(vector, axis=-1)

    one_hot = np.zeros((squeezed_vector.size, one_hot_size))

    one_hot[np.arange(squeezed_vector.size), squeezed_vector] = 1

    return one_hot

label_vector_to_one_hot_vector(vector=[[2], [0], [1]], one_hot_size=3)

Jordy Van Landeghem · Answer 20 · 2018-06-05T13:50:11.480

I am adding for completion a simple function, using only numpy operators:

   def probs_to_onehot(output_probabilities):
        argmax_indices_array = np.argmax(output_probabilities, axis=1)
        onehot_output_array = np.eye(np.unique(argmax_indices_array).shape[0])[argmax_indices_array.reshape(-1)]
        return onehot_output_array

It takes as input a probability matrix: e.g.:

[[0.03038822 0.65810204 0.16549407 0.3797123 ] ... [0.02771272 0.2760752 0.3280924 0.33458805]]

And it will return

[[0 1 0 0] ... [0 0 0 1]]

score -1 · Answer 21 · answered Feb 27 '19 at 18:33

-1

Use the following code. It works best.

def one_hot_encode(x):
"""
    argument
        - x: a list of labels
    return
        - one hot encoding matrix (number of labels, number of class)
"""
encoded = np.zeros((len(x), 10))

for idx, val in enumerate(x):
    encoded[idx][val] = 1

return encoded

Found it here P.S You don't need to go into the link.

answered Feb 27 '19 at 18:33

Inaam Ilahi

105
2
9

9

You should avoid using loops with numpy – Kenan Mar 01 '19 at 02:42
It does not answer the question: "Is there a quick way to do this? Quicker than just looping over a to set elements of b, that is." – Alexandre Huat Jul 06 '20 at 10:00
@AlexandreHuat You can use the numpy function np.eye() – Inaam Ilahi Oct 22 '20 at 09:58
1

Then you should make an answer where you say that one can use `numpy.eye() (but it was already done by another user). Please, make sure to read questions and already posted answers carefully in order to maintain the quality of stackoverflow and the community. – Alexandre Huat Oct 22 '20 at 15:11

Guillaume Chevalier · Answer 22 · 2020-01-03T15:49:43.700

-1

Using a Neuraxle pipeline step:

Set up your example

import numpy as np
a = np.array([1,0,3])
b = np.array([[0,1,0,0], [1,0,0,0], [0,0,0,1]])

Do the actual conversion

from neuraxle.steps.numpy import OneHotEncoder
encoder = OneHotEncoder(nb_columns=4)
b_pred = encoder.transform(a)

Assert it works

assert b_pred == b

Link to documentation: neuraxle.steps.numpy.OneHotEncoder

edited Jan 03 '20 at 15:49

answered Dec 10 '19 at 07:39

Guillaume Chevalier

9,613
8
51
79

Convert array of indices to one-hot encoded array in NumPy

22 Answers22

Using a Neuraxle pipeline step:

Linked

Related