One Hot Encoding using numpy

Question

If the input is zero I want to make an array which looks like this:

[1,0,0,0,0,0,0,0,0,0]

and if the input is 5:

[0,0,0,0,0,1,0,0,0,0]

For the above I wrote:

np.put(np.zeros(10),5,1)

but it did not work.

Is there any way in which, this can be implemented in one line?

Why do you want to do this in one line? If you want to keep it compact, just write a function. — PM 2Ring, Jul 26 '16 at 14:30
It is customary to select one of the answers when you have been provided with at least one that solves your problem. — Mad Physicist, Jul 27 '16 at 15:12

Martin Thoma · Accepted Answer · 2018-08-10T05:21:44.627

120

Usually, when you want to get a one-hot encoding for classification in machine learning, you have an array of indices.

import numpy as np
nb_classes = 6
targets = np.array([[2, 3, 4, 0]]).reshape(-1)
one_hot_targets = np.eye(nb_classes)[targets]

The one_hot_targets is now

array([[[ 0.,  0.,  1.,  0.,  0.,  0.],
        [ 0.,  0.,  0.,  1.,  0.,  0.],
        [ 0.,  0.,  0.,  0.,  1.,  0.],
        [ 1.,  0.,  0.,  0.,  0.,  0.]]])

The .reshape(-1) is there to make sure you have the right labels format (you might also have [[2], [3], [4], [0]]). The -1 is a special value which means "put all remaining stuff in this dimension". As there is only one, it flattens the array.

Copy-Paste solution

def get_one_hot(targets, nb_classes):
    res = np.eye(nb_classes)[np.array(targets).reshape(-1)]
    return res.reshape(list(targets.shape)+[nb_classes])

Package

You can use mpu.ml.indices2one_hot. It's tested and simple to use:

import mpu.ml
one_hot = mpu.ml.indices2one_hot([1, 3, 0], nb_classes=5)

edited Aug 10 '18 at 05:21

answered Mar 18 '17 at 13:01

Martin Thoma

124,992
159
614
958

But how it works? `np.eye(nb_classes)` should be 6x6 matrix, but it's shape changed to 4x6. Can you eleborate on this? – mrgloom Aug 08 '17 at 09:26
1

`np.eye(nb_classes)` is a 6x6 matrix. Then I select the rows specified in target. I only select four, so it is a 4x6 matrix. – Martin Thoma Aug 08 '17 at 09:37
this seems to work only for 2-dim targets, but could be generalized for further shapes by executing `.reshape(list(targets.shape)+[nb_classes])` – siddhadev Jun 20 '18 at 15:28
Could you explain why `np.eye(nb_classes)[np.array(targets).reshape(-1)]`works? It's a CxC matrix indexed by a H*W matrix?! What is going on here? – gebbissimo Mar 03 '21 at 12:24
@gebbissimo First, try to undertand what `np.eye(n)` does. Then `np.eye(5)[[3, 1]]` – Martin Thoma Mar 03 '21 at 13:32

HolyDanna · Answer 2 · 2016-07-26T14:26:35.897

10

Something like :

np.array([int(i == 5) for i in range(10)])

Should do the trick. But I suppose there exist other solutions using numpy.

edit : the reason why your formula does not work : np.put does not return anything, it just modifies the element given in first parameter. The good answer while using np.put() is :

a = np.zeros(10)
np.put(a,5,1)

The problem is that it can't be done in one line, as you need to define the array before passing it to np.put()

edited Jul 26 '16 at 14:26

answered Jul 26 '16 at 14:19

HolyDanna

609
4
13

3

@AbhijayGhildyal: That's just about the most _inefficient_ way to accomplish what you want. – PM 2Ring Jul 26 '16 at 14:41
1

@PM2Ring I know the one-liner I wrote is bad, but do you have any source telling what you should and should not do with list and numpy arrays ? – HolyDanna Jul 26 '16 at 14:45
1

@HolyDanna: It's a general rule in Python that a Python loop runs slower than one that executes using C code. So if there's an obvious way to use a C loop instead of a Python one you should use the C loop. And the whole point of using Numpy is to do array processing at C speed, when possible. I'm not familiar with the numpy source code, but `numpy.zeros` probably runs even faster than a C `for` loop, since the CPU can fill a block of memory with a single value _very_ quickly. – PM 2Ring Jul 26 '16 at 14:56
BTW, I'm _not_ saying that your 1st code example is bad. In a non-Numpy program it would be a _good_ way to do this, and it'd be silly to import Numpy just for this operation. But if the program's already using Numpy anyway it makes sense to take advantage of what Numpy has to offer. – PM 2Ring Jul 26 '16 at 14:57

score 5 · Answer 3 · answered Nov 17 '17 at 14:49

5

You could use List comprehension:

[0 if i !=5 else 1 for i in range(10)]

turns to

[0,0,0,0,0,1,0,0,0,0]

answered Nov 17 '17 at 14:49

Rikku Porta

307
1
4
18

score 4 · Answer 4 · answered Jun 05 '17 at 03:45

4

I'm not sure the performance, but the following code works and it's neat.

x = np.array([0, 5])
x_onehot = np.identity(6)[x]

answered Jun 05 '17 at 03:45

Ken Chan

41
4

that is basically equivalent to the accepted answer. thank you for answering it again. – Nik O'Lai Aug 03 '21 at 16:48

score 3 · Answer 5 · edited Mar 30 '19 at 03:02

3

Use np.identity or np.eye. You can try something like this with your input i, and the array size s:

np.identity(s)[i:i+1]

For example, print(np.identity(5)[0:1]) will result:

[[ 1.  0.  0.  0.  0.  0.  0.  0.  0.  0.]]

If you are using TensorFlow, you can use tf.one_hot: https://www.tensorflow.org/api_docs/python/array_ops/slicing_and_joining#one_hot

edited Mar 30 '19 at 03:02

Mark Adler

101,978
13
118
158

answered Feb 07 '17 at 05:24

Sung Kim

8,417
9
34
42

m00am · Answer 6 · 2016-07-26T14:36:48.600

The problem here is that you save your array nowhere. The put function works in place on the array and returns nothing. Since you never give your array a name you can not address it later. So this

one_pos = 5
x = np.zeros(10)
np.put(x, one_pos, 1)

would work, but then you could just use indexing:

one_pos = 5
x = np.zeros(10)
x[one_pos] = 1

In my opinion that would be the correct way to do this if no special reason exists to do this as a one liner. This might also be easier to read and readable code is good code.

score 2 · Answer 7 · answered Jul 26 '16 at 14:27

Taking a quick look at the manual, you will see that np.put does not return a value. While your technique is fine, you are accessing None instead of your result array.

For a 1-D array it is better to just use direct indexing, especially for such a simple case.

Here is how to rewrite your code with minimal modification:

arr = np.zeros(10)
np.put(arr, 5, 1)

Here is how to do the second line with indexing instead of put:

arr[5] = 1

PM 2Ring · Answer 8 · 2016-07-26T15:01:50.407

2

The np.put mutates its array arg in-place. It's conventional in Python for functions / methods that perform in-place mutation to return None; np.put adheres to that convention. So if a is a 1D array and you do

a = np.put(a, 5, 1)

then a will get replaced by None.

Your code is similar to that, but it passes an un-named array to np.put.

A compact & efficient way to do what you want is with a simple function, eg:

import numpy as np

def one_hot(i):
    a = np.zeros(10, 'uint8')
    a[i] = 1
    return a

a = one_hot(5) 
print(a)

output

[0 0 0 0 0 1 0 0 0 0]

edited Jul 26 '16 at 15:01

answered Jul 26 '16 at 14:47

PM 2Ring

54,345
6
82
182

1

I'll take not of that, so as not to be rude to people – HolyDanna Jul 26 '16 at 15:06

score 0 · Answer 9 · answered Jul 26 '16 at 15:02

0

import time
start_time = time.time()
z=[]
for l in [1,2,3,4,5,6,1,2,3,4,4,6,]:
    a= np.repeat(0,10)
    np.put(a,l,1)
    z.append(a)
print("--- %s seconds ---" % (time.time() - start_time))

#--- 0.00174784660339 seconds ---

import time
start_time = time.time()
z=[]
for l in [1,2,3,4,5,6,1,2,3,4,4,6,]:
    z.append(np.array([int(i == l) for i in range(10)]))
print("--- %s seconds ---" % (time.time() - start_time))

#--- 0.000400066375732 seconds ---

answered Jul 26 '16 at 15:02

Abhijay Ghildyal

4,044
6
33
54

Using `a=np.zeros(10)`, I get a slightly faster version with the first version : `0.0007712841033935547 seconds` against `0.0008835792541503906 seconds` for the second version – HolyDanna Jul 26 '16 at 15:05
1

Try `a = np.zeros(10); a[l] = 1`; indexed assignment is faster than doing a function call. My `one_hot` function is a little slower than this inline version, also due to the overhead of the function call, but it's faster than the other techniques. However, this timing info is not very accurate, you should use the `timeit` module, and use its facitlities to perform your tests hundreds (or thousands) of times to get meaningful results that aren't swamped by the "noise" of other tasks your CPU is performing. – PM 2Ring Jul 26 '16 at 15:18
Thanks. Do you know of any better ways to check code run times? – Abhijay Ghildyal Jul 26 '16 at 15:23
1

As I said, use the [timeit](https://docs.python.org/3/library/timeit.html) module. FWIW, here are a couple of my recent answers that use `timeit` http://stackoverflow.com/a/38075792/4014959 and http://stackoverflow.com/a/36030019/4014959 – PM 2Ring Jul 26 '16 at 15:33

One Hot Encoding using numpy

9 Answers9

Copy-Paste solution

Package

Linked

Related