12

I'm supposed to normalize an array. I've read about normalization and come across a formula:

enter image description here

I wrote the following function for it:

def normalize_list(list):
    max_value = max(list)
    min_value = min(list)
    for i in range(0, len(list)):
        list[i] = (list[i] - min_value) / (max_value - min_value)

That is supposed to normalize an array of elements.

Then I have come across this: https://stackoverflow.com/a/21031303/6209399 Which says you can normalize an array by simply doing this:

def normalize_list_numpy(list):
    normalized_list = list / np.linalg.norm(list)
    return normalized_list

If I normalize this test array test_array = [1, 2, 3, 4, 5, 6, 7, 8, 9] with my own function and with the numpy method, I get these answers:

My own function: [0.0, 0.125, 0.25, 0.375, 0.5, 0.625, 0.75, 0.875, 1.0]
The numpy way: [0.059234887775909233, 0.11846977555181847, 0.17770466332772769, 0.23693955110363693, 0.29617443887954614, 0.35540932665545538, 0.41464421443136462, 0.47387910220727386, 0.5331139899831830

Why do the functions give different answers? Is there others way to normalize an array of data? What does numpy.linalg.norm(list) do? What do I get wrong?

rzaaeeff
  • 850
  • 1
  • 10
  • 18
OuuGiii
  • 1,053
  • 1
  • 11
  • 31
  • 3
    Just so you're aware, this isn't the traditional formula for normalization, which is usually expressed as (x - x_mean) / stdev(x), which standardizes x to be normally distributed. (stdev is standard deviation.) – Brad Solomon Oct 24 '17 at 16:25
  • Agree with Brad. Your formula scales the values to the interval [0, 1], while "normalization" more often means transforming to have mean 0 and variance 1 (in statistics), or scaling a vector to have unit length with respect to some norm (usually L2). – phipsgabler Oct 24 '17 at 16:30
  • 1
    Isn't that called 'Standardization'? @phg – OuuGiii Oct 24 '17 at 16:37
  • 1
    @OuuGiii No, without having an official reference to cite I would say that both "normalization" and "standardization" refer to subtracting out a mean and dividing by a standard deviation to get the data to have an N~(0,1) distribution. Maybe normalization could take on the meaning you mention in linear algebra contexts, but I would say phg's is the dominant usage. – Brad Solomon Oct 24 '17 at 16:39
  • I've tried the way you said it @BradSolomon by "x - x_mean) / stdev(x)", it still doesn't give the same answer for the numpy way to normalize a list. What does the numpy way do? – OuuGiii Oct 24 '17 at 16:45
  • `normalize_list_numpy` as you have it defined is something completely different from the type of scaling that I'm talking about and that @utengr mentions also. This isn't "the NumPy way", it's just NumPy's way of implementing that specific definition of scaling. My point is mathematically, they're two totally different things. – Brad Solomon Oct 24 '17 at 16:47
  • 2
    @OuuGiii yes, according to [this answer](https://stats.stackexchange.com/a/10298/112762) at least, **normalization** refers to a [0,1] range, and **standardization** refers to a mean 0 variance 1. – Brian Burns May 30 '18 at 13:23
  • Now that you see that "normalize" is context-dependent, ask the person who told you what you are supposed to do what meant. Don't ask other people to guess. – philipxy Dec 03 '18 at 12:10

3 Answers3

13

There are different types of normalization. You are using min-max normalization. The min-max normalization from scikit learn is as follows.

import numpy as np
from sklearn.preprocessing import minmax_scale

# your function
def normalize_list(list_normal):
    max_value = max(list_normal)
    min_value = min(list_normal)
    for i in range(len(list_normal)):
        list_normal[i] = (list_normal[i] - min_value) / (max_value - min_value)
    return list_normal

#Scikit learn version 
def normalize_list_numpy(list_numpy):
    normalized_list = minmax_scale(list_numpy)
    return normalized_list

test_array = [1, 2, 3, 4, 5, 6, 7, 8, 9]
test_array_numpy = np.array(test_array)

print(normalize_list(test_array))
print(normalize_list_numpy(test_array_numpy))

Output:

[0.0, 0.125, 0.25, 0.375, 0.5, 0.625, 0.75, 0.875, 1.0]    
[0.0, 0.125, 0.25, 0.375, 0.5, 0.625, 0.75, 0.875, 1.0]

MinMaxscaler uses exactly your formula for normalization/scaling: http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.minmax_scale.html

@OuuGiii: NOTE: It is not a good idea to use Python built-in function names as varibale names. list() is a Python builtin function so its use as a variable should be avoided.

utengr
  • 3,225
  • 3
  • 29
  • 68
  • 2
    Didn't know this existed, +1. @OuuGii directly from the docs for this function, "This transformation is often used as an alternative to zero mean, unit variance scaling." – Brad Solomon Oct 24 '17 at 16:50
  • @BradSolomon Its used quite often in sklearn for feature scaling before they are fed to various sensitive classifiers such as svm or knn etc. – utengr Oct 24 '17 at 16:57
7

The question/answer that you reference doesn't explicitly relate your own formula to the np.linalg.norm(list) version that you use here.

One NumPy solution would be this:

import numpy as np
def normalize(x):
    x = np.asarray(x)
    return (x - x.min()) / (np.ptp(x))

print(normalize(test_array))    
# [ 0.     0.125  0.25   0.375  0.5    0.625  0.75   0.875  1.   ]

Here np.ptp is peak-to-peak ie

Range of values (maximum - minimum) along an axis.

This approach scales the values to the interval [0, 1] as pointed out by @phg.

The more traditional definition of normalization would be to scale to a 0 mean and unit variance:

x = np.asarray(test_array)
res = (x - x.mean()) / x.std()
print(res.mean(), res.std())
# 0.0 1.0

Or use sklearn.preprocessing.normalize as a pre-canned function.

Using test_array / np.linalg.norm(test_array) creates a result that is of unit length; you'll see that np.linalg.norm(test_array / np.linalg.norm(test_array)) equals 1. So you're talking about two different fields here, one being statistics and the other being linear algebra.

Brad Solomon
  • 38,521
  • 31
  • 149
  • 235
2

The power of python is its broadcasting property, which allows you to do vectorizing array operations without explicit looping. So, You do not need to write a function using explicit for loop, which is slow and time-consuming, especially if your dataset is too big.

The pythonic way of doing min-max normalization is

test_array = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9])
normalized_test_array = (test_array - min(test_array)) / (max(test_array) - min(test_array)) 

output >> [ 0., 0.125, 0.25, 0.375, 0.5, 0.625, 0.75, 0.875, 1. ]

ewalel
  • 1,932
  • 20
  • 25