62

I need to normalize a list of values to fit in a probability distribution, i.e. between 0.0 and 1.0.

I understand how to normalize, but was curious if Python had a function to automate this.

I'd like to go from:

raw = [0.07, 0.14, 0.07]  

to

normed = [0.25, 0.50, 0.25]
APerson
  • 8,140
  • 8
  • 35
  • 49
Adam_G
  • 7,337
  • 20
  • 86
  • 148
  • 10
    why wouldnt that be `[0.5, 1.0, 0.5]`? – Joran Beasley Nov 06 '14 at 17:15
  • 6
    @Joran Because OP wants `sum(normed) == 1.0` (ignoring floating point errors). – Kevin Nov 06 '14 at 17:16
  • See this post if you would like to normalize between a different range. [How to normalize a list of positive and negative decimal number to a specific range](http://stackoverflow.com/questions/16514443/how-to-normalize-a-list-of-positive-and-negative-decimal-number-to-a-specific-ra) – salomonvh Sep 22 '15 at 07:57

11 Answers11

110

Use :

norm = [float(i)/sum(raw) for i in raw]

to normalize against the sum to ensure that the sum is always 1.0 (or as close to as possible).

use

norm = [float(i)/max(raw) for i in raw]

to normalize against the maximum

shivank01
  • 1,015
  • 3
  • 16
  • 35
Tony Suffolk 66
  • 9,358
  • 3
  • 30
  • 33
  • 33
    Nice. It's maybe worth noting that computing the sum in advance, rather than for each element in the comprehension, would be more efficient. So: `s = sum(raw); norm = [float(i)/s for i in raw]` – mattsilver May 05 '15 at 23:43
  • Is that the same as `(np.array(x) / np.array(x).sum()) / np.array(x).max()` ? – alvas Feb 21 '18 at 02:40
  • 1
    @alvas sorry - I can't be sure about numpy - but assuming dividing an array by a single value divides each value in the array; then it looks right. – Tony Suffolk 66 Feb 21 '18 at 14:18
17

if your list has negative numbers, this is how you would normalize it

a = range(-30,31,5)
norm = [(float(i)-min(a))/(max(a)-min(a)) for i in a]
blaylockbk
  • 2,503
  • 2
  • 28
  • 43
10

For ones who wanna use scikit-learn, you can use

from sklearn.preprocessing import normalize

x = [1,2,3,4]
normalize([x]) # array([[0.18257419, 0.36514837, 0.54772256, 0.73029674]])
normalize([x], norm="l1") # array([[0.1, 0.2, 0.3, 0.4]])
normalize([x], norm="max") # array([[0.25, 0.5 , 0.75, 1.]])
Anh-Thi DINH
  • 1,845
  • 1
  • 23
  • 17
  • Or for a completely different kind of normalization: `from sklearn.utils.extmath import softmax` or `from scipy.special import softmax` – Stef Dec 08 '21 at 13:55
7

How long is the list you're going to normalize?

def psum(it):
    "This function makes explicit how many calls to sum() are done."
    print "Another call!"
    return sum(it)

raw = [0.07,0.14,0.07]
print "How many calls to sum()?"
print [ r/psum(raw) for r in raw]

print "\nAnd now?"
s = psum(raw)
print [ r/s for r in raw]

# if one doesn't want auxiliary variables, it can be done inside
# a list comprehension, but in my opinion it's quite Baroque    
print "\nAnd now?"
print [ r/s  for s in [psum(raw)] for r in raw]

Output

# How many calls to sum()?
# Another call!
# Another call!
# Another call!
# [0.25, 0.5, 0.25]
# 
# And now?
# Another call!
# [0.25, 0.5, 0.25]
# 
# And now?
# Another call!
# [0.25, 0.5, 0.25]
gboffi
  • 22,939
  • 8
  • 54
  • 85
6

try:

normed = [i/sum(raw) for i in raw]

normed
[0.25, 0.5, 0.25]
Anzel
  • 19,825
  • 5
  • 51
  • 52
4

There isn't any function in the standard library (to my knowledge) that will do it, but there are absolutely modules out there which have such functions. However, its easy enough that you can just write your own function:

def normalize(lst):
    s = sum(lst)
    return map(lambda x: float(x)/s, lst)

Sample output:

>>> normed = normalize(raw)
>>> normed
[0.25, 0.5, 0.25]
wnnmaw
  • 5,444
  • 3
  • 38
  • 63
  • This is one of the two answers that extract `sum()` from the loop... I still prefer mine but I think this is a `+` exactly for the auxiliary variable `s = sum(lst)`. – gboffi Nov 06 '14 at 17:37
  • 4
    `normalize([1,0,-1])` will raise `ZeroDivisionError` :) – Yan Foto Nov 14 '15 at 14:06
4

If you consider using numpy, you can get a faster solution.

import random, time
import numpy as np

a = random.sample(range(1, 20000), 10000)
since = time.time(); b = [i/sum(a) for i in a]; print(time.time()-since)
# 0.7956490516662598

since = time.time(); c=np.array(a);d=c/sum(a); print(time.time()-since)
# 0.001413106918334961
Tengerye
  • 1,796
  • 1
  • 23
  • 46
  • Ru sure this equation is right? I am getting vals in d < 0. Not sure if this should happen. Maybe I did something wrong. I am inputting vals from ~ -0.5 to 05.? – ScipioAfricanus Sep 02 '19 at 21:43
  • @ScipioAfricanus `random.sample` only works on integer. If float is required, check `np.random.uniform' or something similar instead. – Tengerye Sep 03 '19 at 01:59
3

Try this :

from __future__ import division

raw = [0.07, 0.14, 0.07]  

def norm(input_list):
    norm_list = list()

    if isinstance(input_list, list):
        sum_list = sum(input_list)

        for value in input_list:
            tmp = value  /sum_list
            norm_list.append(tmp) 

    return norm_list

print norm(raw)

This will do what you asked. But I will suggest to try Min-Max normalization.

min-max normalization :

def min_max_norm(dataset):
    if isinstance(dataset, list):
        norm_list = list()
        min_value = min(dataset)
        max_value = max(dataset)

        for value in dataset:
            tmp = (value - min_value) / (max_value - min_value)
            norm_list.append(tmp)

    return norm_list
Nurul Akter Towhid
  • 3,046
  • 2
  • 33
  • 35
2

If working with data, many times pandas is the simple key

This particular code will put the raw into one column, then normalize by column per row. (But we can put it into a row and do it by row per column, too! Just have to change the axis values where 0 is for row and 1 is for column.)

import pandas as pd


raw = [0.07, 0.14, 0.07]  

raw_df = pd.DataFrame(raw)
normed_df = raw_df.div(raw_df.sum(axis=0), axis=1)
normed_df

where normed_df will display like:

    0
0   0.25
1   0.50
2   0.25

and then can keep playing with the data, too!

1

Here is a not-terribly-inefficient one liner similar to the top answer (only performs summation once)

norm = (lambda the_sum:[float(i)/the_sum for i in raw])(sum(raw))

A similar method can be done for a list with negative numbers

norm = (lambda the_max, the_min: [(float(i)-the_min)/(the_max-the_min) for i in raw])(max(raw),min(raw))
Jeff Hykin
  • 1,846
  • 16
  • 25
0

Use scikit-learn:

from sklearn.preprocessing import MinMaxScaler
data = np.array([1,2,3]).reshape(-1, 1)
scaler = MinMaxScaler()
scaler.fit(data)
print(scaler.transform(data))
keramat
  • 4,328
  • 6
  • 25
  • 38