20

I have a huge data set from which I derive two sets of datapoints, which I then have to plot and compare. These two plots differ in their in their range, so I want them to be in the range of [0,1]. For the following code and a specific data set I get a constant line at 1 as the dataset plot, but this normalization works well for other sets:

plt.plot(range(len(rvalue)),np.array(rvalue)/(max(rvalue)))

and for this code :

oldrange = max(rvalue) - min(rvalue)  # NORMALIZING
newmin = 0
newrange = 1 + 0.9999999999 - newmin
normal = map(
    lambda x, r=float(rvalue[-1] - rvalue[0]): ((x - rvalue[0]) / r)*1 - 0, 
    rvalue)
plt.plot(range(len(rvalue)), normal)

I get the error:

ZeroDivisionError: float division by zero

for all the data sets. I am unable to figure out how to get both the plots in one range for comparison.

djvg
  • 11,722
  • 5
  • 72
  • 103
pypro
  • 447
  • 1
  • 7
  • 13
  • For those interested in normalizing data in Django, have a look a this solution: https://stackoverflow.com/a/68258914 – djvg Jul 05 '21 at 17:45

9 Answers9

48

Use the following method to normalize your data in the range of 0 to 1 using min and max value from the data sequence:

import numpy as np

def NormalizeData(data):
    return (data - np.min(data)) / (np.max(data) - np.min(data))
user3284005
  • 591
  • 4
  • 4
21

Use scikit: http://scikit-learn.org/stable/modules/preprocessing.html#scaling-features-to-a-range

It has built in functions to scale features to a specified range. You'll find other functions to normalize and standardize here.

See this example:

>>> import numpy as np
>>> from sklearn import preprocessing
>>> X_train = np.array([[ 1., -1.,  2.],
...                     [ 2.,  0.,  0.],
...                     [ 0.,  1., -1.]])
...
>>> min_max_scaler = preprocessing.MinMaxScaler()
>>> X_train_minmax = min_max_scaler.fit_transform(X_train)
>>> X_train_minmax
array([[ 0.5       ,  0.        ,  1.        ],
       [ 1.        ,  0.5       ,  0.33333333],
       [ 0.        ,  1.        ,  0.        ]])
Jeril
  • 7,858
  • 3
  • 52
  • 69
Marissa Novak
  • 544
  • 3
  • 6
10

scikit_learn has a function for this
sklearn.preprocessing.minmax_scale(X, feature_range=(0, 1), axis=0, copy=True)

More convenient than using the Class MinMaxScale.

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.minmax_scale.html#sklearn.preprocessing.minmax_scale

R Zhang
  • 309
  • 3
  • 5
  • 2
    what is sklearn? Answers are generally less accepted if users have to ask for clarifications. I suggest adding more of an explanation to your answer so it isn't flagged as low quality. – Danoram Oct 11 '19 at 00:16
7

Finding the range of an array is provided by numpy built-in function numpy.ptp(), your question can be addressed by:

#First we should filter input_array so that it does not contain NaN or Inf.
input_array=np.array(some_data)
if np.unique(input_array).shape[0]==1:
    pass #do thing if the input_array is constant
else:
    result_array=(input_array-np.min(input_array))/np.ptp(input_array)
#To extend it to higher dimension, add axis= kwarvg to np.min and np.ptp
djvg
  • 11,722
  • 5
  • 72
  • 103
CT Zhu
  • 52,648
  • 17
  • 120
  • 133
2

I tried to simplify things a little. Try this:

oldmin = min(rvalue)
oldmax = max(rvalue)
oldrange = oldmax - oldmin
newmin = 0.
newmax = 1.
newrange = newmax - newmin
if oldrange == 0:            # Deal with the case where rvalue is constant:
    if oldmin < newmin:      # If rvalue < newmin, set all rvalue values to newmin
        newval = newmin
    elif oldmin > newmax:    # If rvalue > newmax, set all rvalue values to newmax
        newval = newmax
    else:                    # If newmin <= rvalue <= newmax, keep rvalue the same
        newval = oldmin
    normal = [newval for v in rvalue]
else:
    scale = newrange / oldrange
    normal = [(v - oldmin) * scale + newmin for v in rvalue]

plt.plot(range(len(rvalue)),normal)

The only reason I can see for the ZeroDivisionError is if the data in rvalue were constant (all values are the same). Is that the case?

Brionius
  • 13,858
  • 3
  • 38
  • 49
  • Yeah i see that for some cases the rvalue is constant and so the oldrange =0. Also i figured out that for most of the data sets my ravlue plot remains in the range of 0,1 , so i guess there wont be a need to normalise this plot just the other one needs to be normalised. But i was wondering that in order to make my code work for all kind of data sets(in which rvalue isn't in the range [0,1]), is there any way to normalise without getting an error?... – pypro Aug 22 '13 at 13:33
  • @user2690054: Sure, you just have to decide what the behavior should be. For example, if `rvalue = [-20, -20, ... , -20]`, should that be mapped to `[0.0, 0.0, ..., 0.0]`? And should `rvalue = [30, 30, ..., 30]` be mapped to `[1.0, 1.0, ..., 1.0]`? – Brionius Aug 22 '13 at 13:43
  • @user2690054 I added some statements to deal with oldrange being zero - see if it does what you want. – Brionius Aug 22 '13 at 13:48
  • I think the modifications should suffice my requirement depending upon what behavior i want as u mentioned..Thanks a lot for that...the only glitch is that the line:scale = newrange / oldrange should be in the else part bcoz it gives the zerodivision error at that place itself and doesn't enter into the if clause. Thanks for helping! – pypro Aug 22 '13 at 16:09
1

Just to provide some background for the other answers, here's a derivation:

A straight line through points (x1, y1) and (x2, y2) can be expressed as:

y = y1 + slope * (x - x1)

where

slope = (y2 - y1) / (x2 - x1)

now, normalization from 0 to 1 implies

y1 = 0, y2 = 1

and

x1 = x_min, x2 = x_max

(or vice versa, depending on your needs)

the equation then reduces to

y = (x - x_min) / (x_max - x_min)
djvg
  • 11,722
  • 5
  • 72
  • 103
1

I prefer preprocessing tools for sci-kit learn similar to Marissa Novak's and RZhang's answers. Though I like a different structure:

import numpy as np
from sklearn import preprocessing

# data
years = [1972 1973 1974 1975 1976 1977 1978 1979 1984 1986 1989 1993 1994 1997
 1998 1999 2002 2004 2010 2017 2018 2021 2022]

# specify the range to which you want to scale
rng = (0, 1) 

# initiate the scaler
# 0,1 is the default feature_range and doesn't have to be specified
scaler = preprocessing.MinMaxScaler(feature_range=(rng[0], rng[1]))

# apply the scaler
normed = scaler.fit_transform(np.array(years).reshape(-1, 1))

# the output is an array of arrays, so tidy the dimensions
norm_lst = [round(i[0],2) for i in normed]

While this is more verbose than RZhang's answer and less preferable for the original use-case with a "huge" data set, I prefer it for readability for most of my applications (<10^3 values).

rng = (0,1) yields:

[0.0, 0.02, 0.04, 0.06, 0.08, 0.1, 0.12, 0.14, 0.24, 0.28, 0.34, 0.42, 0.44, 0.5, 0.52, 0.54, 0.6, 0.64, 0.76, 0.9, 0.92, 0.98, 1.0]

rng = (0.3,0.8), for example, yields:

[0.3, 0.31, 0.32, 0.33, 0.34, 0.35, 0.36, 0.37, 0.42, 0.44, 0.47, 0.51, 0.52, 0.55, 0.56, 0.57, 0.6, 0.62, 0.68, 0.75, 0.76, 0.79, 0.8]
CreekGeek
  • 1,809
  • 2
  • 14
  • 24
0

you can divide each number in your sample by the sum of all the numbers in your sample.

-3

A simple way to normalize anything between 0 and 1 is just divide all the values by max value, from the all values. Will bring values between range of 0 to 1.

Jay Dangar
  • 3,271
  • 1
  • 16
  • 35