Normalize numpy array columns in python

Question

I have a numpy array where each cell of a specific row represents a value for a feature. I store all of them in an 100*4 matrix.

A     B   C
1000  10  0.5
765   5   0.35
800   7   0.09

Any idea how I can normalize rows of this numpy.array where each value is between 0 and 1?

My desired output is:

A     B    C
1     1    1
0.765 0.5  0.7
0.8   0.7  0.18(which is 0.09/0.5)

Just to be clear: is it a NumPy array or a Pandas DataFrame? — Alex Riley, Apr 15 '15 at 21:52
When programming it's important to be specific: a `set` is a particular object in Python, and you can't have a set of numpy arrays. Python doesn't have a matrix, but numpy does, and that `matrix` type isn't the same as a numpy `array/ndarray` (which is itself different from Python's `array` type, which is not the same as a `list`). And none of these are pandas `DataFrame`s.. — DSM, Apr 15 '15 at 21:58
I do not think this is a complete normalization. I would look at http://stackoverflow.com/questions/9775765/normalize-standardize-a-numpy-recarray for a better definition of normalization. — 1man, Jan 15 '17 at 01:36

ali_m · Accepted Answer · 2016-01-29T22:51:48.150

122

If I understand correctly, what you want to do is divide by the maximum value in each column. You can do this easily using broadcasting.

Starting with your example array:

import numpy as np

x = np.array([[1000,  10,   0.5],
              [ 765,   5,  0.35],
              [ 800,   7,  0.09]])

x_normed = x / x.max(axis=0)

print(x_normed)
# [[ 1.     1.     1.   ]
#  [ 0.765  0.5    0.7  ]
#  [ 0.8    0.7    0.18 ]]

x.max(0) takes the maximum over the 0th dimension (i.e. rows). This gives you a vector of size (ncols,) containing the maximum value in each column. You can then divide x by this vector in order to normalize your values such that the maximum value in each column will be scaled to 1.

If x contains negative values you would need to subtract the minimum first:

x_normed = (x - x.min(0)) / x.ptp(0)

Here, x.ptp(0) returns the "peak-to-peak" (i.e. the range, max - min) along axis 0. This normalization also guarantees that the minimum value in each column will be 0.

edited Jan 29 '16 at 22:51

answered Apr 15 '15 at 22:02

ali_m

71,714
23
223
298

1

I really appreciate your answer, I always have issues dealing with "axis" ! – ahajib Apr 16 '15 at 05:39
9

For reductions (i.e. `.max()`, `.min()`, `.sum()`, `.mean()` etc.), you just need to remember that `axis` specifies the dimension that you want to "collapse" during the reduction. If you want the maximum for each column, then you need to collapse the the row dimension. – ali_m Apr 16 '15 at 09:41
1

@rawbeans See my update. The reason I divided by the maximum is because that's what the OP showed in their example. – ali_m Jan 29 '16 at 22:50
@ali_m, Would you please explain why you are saying "If x contains negative values"? If the minimum of the array is 100 and the maximum is 103, I think you should definitely use your second formula, otherwise your result will not have a 0 offset. – 1man Jan 15 '17 at 01:33
@lman Simply because that's exactly what the OP showed in their example. *"Between 0 and 1"* is ambiguous - it doesn't necessarily imply that the minimum value must be zero, only that all of the values must be >= 0 and <= 1. – ali_m Jan 15 '17 at 12:05
Note that this broadcast notation won't work if collapsing any axis other than 0; if you want to normalize the rows, you'll need to be more explicit in the division step, or it will normalize the columns by the sums of the rows! – Galactic Ketchup Apr 27 '20 at 16:44
1

@GalacticKetchup You can easily extend this to reductions over arbitrary axes by passing `keepdims=True` to the reduction ufunc. This arg prevents the reduction axis from getting "squeezed out" so that broadcasting will still work correctly, e.g. `x / x.max(axis=1, keepdims=True)`. – ali_m Apr 27 '20 at 17:36

score 31 · Answer 2 · answered May 30 '17 at 08:45

31

You can use sklearn.preprocessing:

from sklearn.preprocessing import normalize
data = np.array([
    [1000, 10, 0.5],
    [765, 5, 0.35],
    [800, 7, 0.09], ])
data = normalize(data, axis=0, norm='max')
print(data)
>>[[ 1.     1.     1.   ]
[ 0.765  0.5    0.7  ]
[ 0.8    0.7    0.18 ]]

answered May 30 '17 at 08:45

Marcin Mrugas

973
8
17

Any way to scale the column values between ``1`` and ``2`? Using MinMaxScaler? – Robur_131 Oct 12 '20 at 10:16

Normalize numpy array columns in python

2 Answers2

Linked

Related