460

I have a dataframe in pandas where each column has different value range. For example:

df:

A     B   C
1000  10  0.5
765   5   0.35
800   7   0.09

Any idea how I can normalize the columns of this dataframe where each value is between 0 and 1?

My desired output is:

A     B    C
1     1    1
0.765 0.5  0.7
0.8   0.7  0.18(which is 0.09/0.5)
Trenton McKinney
  • 56,955
  • 33
  • 144
  • 158
ahajib
  • 12,838
  • 29
  • 79
  • 120
  • 3
    there is an apply function, e.g. frame.apply(f, axis=1) where f is a function that does something with a row... – tschm Oct 16 '14 at 22:30
  • 2
    Normalization might not be the most appropriate wording, since [scikit-learn documentation](https://scikit-learn.org/stable/modules/preprocessing.html#normalization) defines it as "the process of **scaling individual samples to have unit norm**" (i.e. row by row, if I get it correctly). – Skippy le Grand Gourou Mar 05 '19 at 16:58
  • 3
    I do not get it, why min_max scaling is considered normalization! normal has got to have meaning in the sense of normal distribution with mean zero and variance 1. – OverFlow Police Apr 21 '19 at 02:21
  • 7
    If you are visiting this question in 2020 or later, look at answer by @Poudel, you get different answer of normalizing if you use pandas vs sklearn. – BhishanPoudel Jan 29 '20 at 20:10
  • @Poudel is this due to the `ddof` argument? – fffrost Apr 04 '20 at 20:26
  • @fffrost Yes thats correct. – BhishanPoudel Apr 04 '20 at 21:47
  • Normalisation in this context usually means scaling values so that their sum is 1. This is easy to do in Pandas: `df['normalised'] = df.value / df.value.sum()`. – Tom Apr 06 '22 at 13:32
  • [How to normalize all columns in a dataframe in pandas](https://www.educative.io/edpresso/how-to-normalize-all-columns-in-a-dataframe-in-pandas) – JeeyCi Jun 12 '22 at 06:16

24 Answers24

788

one easy way by using Pandas: (here I want to use mean normalization)

normalized_df=(df-df.mean())/df.std()

to use min-max normalization:

normalized_df=(df-df.min())/(df.max()-df.min())

Edit: To address some concerns, need to say that Pandas automatically applies colomn-wise function in the code above.

Cina
  • 9,759
  • 4
  • 20
  • 36
  • 1
    Can it be somehow done with window function? What I mean by that is calculating max() and min() based on eg latest 10 observation. – krakowi Nov 15 '19 at 10:12
  • 4
    if you want to save some column - do `normalized_df['TARGET'] = df['TARGET']` – Roman Filippov Dec 22 '19 at 09:53
  • 1
    Comparing this with `MinMaxScaler()`, which one would be faster in a case where features will be greater than 1000? And, uses less memory? – SajidSalim Apr 24 '20 at 17:04
  • 1
    this is a good solution, but you need a lot of less-beautiful checks to avoid divide by zero errors – Teddy Ward May 07 '20 at 21:26
  • This takes a very long time on my dataframe (200k rows, 14 columns). Does anyone know why this is the case? – Psychotechnopath May 11 '20 at 09:14
  • 1
    Made a new question; If anyone knows the answer please enlighten me: https://stackoverflow.com/questions/61726904/weird-increase-in-time-when-using-dataframe-mean – Psychotechnopath May 12 '20 at 11:02
  • 2
    is there a built-in standard way of doing this per column without looping over all the columns? – Gulzar Jan 12 '21 at 10:42
  • Note that this will produce a NaN if a column has a column with all the same value – G M Mar 23 '22 at 14:00
  • 1
    Mean and standard deviation is used for standardization. Minimum and maximum value of features are used for normalization. – Starbucks Jun 29 '22 at 18:27
  • @krakowi you could try `normalized_df = (df - df.rolling(10).min())/(df.rolling(10).max()-df.rolling(10).min())` then drop na of course – nickw Jul 03 '22 at 16:06
  • Hi, I want to check does `normalized_df=(df-df.min())/(df.max()-df.min())` normalized by each column? Thanks – Megan Sep 05 '22 at 05:29
412

You can use the package sklearn and its associated preprocessing utilities to normalize the data.

import pandas as pd
from sklearn import preprocessing

x = df.values #returns a numpy array
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
df = pd.DataFrame(x_scaled)

For more information look at the scikit-learn documentation on preprocessing data: scaling features to a range.

Amir Imani
  • 3,118
  • 2
  • 22
  • 24
Sandman
  • 5,432
  • 5
  • 20
  • 23
  • 76
    i think this will get rid of the column names, which might be one of the reasons op is using dataframes in the first place. – pietz Jan 16 '17 at 21:02
  • 75
    This will normalize the rows and not the columns, unless you transpose it first. To do what the Q asks for: `pd.DataFrame(min_max_scaler.fit_transform(df.T), columns=df.columns, index=df.index)` – hobs Jan 20 '17 at 23:47
  • 40
    @pietz to keep column names, see [this post](https://stackoverflow.com/a/29907472/588437). Basically replace the last line with , `df=pandas.DataFrame(x_scaled, columns=df.columns)` – ijoseph Jun 26 '17 at 18:52
  • 6
    @hobs This is not correct. Sandman's code normalizes column-wise and per-column. You get the wrong result if you transpose. – petezurich Apr 01 '18 at 14:10
  • 11
    @petezurich It looks like Sandman or Praveen corrected their code. Unfortunately, it's not possible to correct comments ;) – hobs Apr 03 '18 at 21:25
  • It's much easier to do this with list comprehension in one line. There's no need to import any new libraries. See my answer below. – Chad Aug 01 '19 at 22:02
  • Thanks. But Instead of storing df.values to x and then passing to fit_transform if we write ```.fit_transform(df.values)``` . Can we save memory which is used to store x or Python automatically optimize it for us. – Asis Dec 27 '19 at 19:47
  • 1
    You can't specify a 0 minimum with MinMaxScaler, which is a problem for me. – rjurney Oct 21 '20 at 23:23
83

Detailed Example of Normalization Methods

  • Pandas normalization (unbiased)
  • Sklearn normalization (biased)
  • Does biased-vs-unbiased affect Machine Learning?
  • Mix-max scaling

References: Wikipedia: Unbiased Estimation of Standard Deviation

Example Data

import pandas as pd
df = pd.DataFrame({
               'A':[1,2,3],
               'B':[100,300,500],
               'C':list('abc')
             })
print(df)
   A    B  C
0  1  100  a
1  2  300  b
2  3  500  c

Normalization using pandas (Gives unbiased estimates)

When normalizing we simply subtract the mean and divide by standard deviation.

df.iloc[:,0:-1] = df.iloc[:,0:-1].apply(lambda x: (x-x.mean())/ x.std(), axis=0)
print(df)
     A    B  C
0 -1.0 -1.0  a
1  0.0  0.0  b
2  1.0  1.0  c

Normalization using sklearn (Gives biased estimates, different from pandas)

If you do the same thing with sklearn you will get DIFFERENT output!

import pandas as pd

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()


df = pd.DataFrame({
               'A':[1,2,3],
               'B':[100,300,500],
               'C':list('abc')
             })
df.iloc[:,0:-1] = scaler.fit_transform(df.iloc[:,0:-1].to_numpy())
print(df)
          A         B  C
0 -1.224745 -1.224745  a
1  0.000000  0.000000  b
2  1.224745  1.224745  c

Does Biased estimates of sklearn makes Machine Learning Less Powerful?

NO.

The official documentation of sklearn.preprocessing.scale states that using biased estimator is UNLIKELY to affect the performance of machine learning algorithms and we can safely use them.

From official documentation:

We use a biased estimator for the standard deviation, equivalent to numpy.std(x, ddof=0). Note that the choice of ddof is unlikely to affect model performance.

What about MinMax Scaling?

There is no Standard Deviation calculation in MinMax scaling. So the result is same in both pandas and scikit-learn.

import pandas as pd
df = pd.DataFrame({
               'A':[1,2,3],
               'B':[100,300,500],
             })
(df - df.min()) / (df.max() - df.min())
     A    B
0  0.0  0.0
1  0.5  0.5
2  1.0  1.0


# Using sklearn
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler() 
arr_scaled = scaler.fit_transform(df) 

print(arr_scaled)
[[0.  0. ]
 [0.5 0.5]
 [1.  1. ]]

df_scaled = pd.DataFrame(arr_scaled, columns=df.columns,index=df.index)
print(df_scaled)
     A    B
0  0.0  0.0
1  0.5  0.5
2  1.0  1.0
BhishanPoudel
  • 15,974
  • 21
  • 108
  • 169
77

Based on this post: https://stats.stackexchange.com/questions/70801/how-to-normalize-data-to-0-1-range

You can do the following:

def normalize(df):
    result = df.copy()
    for feature_name in df.columns:
        max_value = df[feature_name].max()
        min_value = df[feature_name].min()
        result[feature_name] = (df[feature_name] - min_value) / (max_value - min_value)
    return result

You don't need to stay worrying about whether your values are negative or positive. And the values should be nicely spread out between 0 and 1.

Community
  • 1
  • 1
Michael Aquilina
  • 5,352
  • 4
  • 33
  • 38
  • 15
    Be careful when min and max values are same, your denominator is 0 and you will get a NaN value. – hru_d Feb 01 '19 at 06:02
  • 1
    @HrushikeshDhumal, No need to normalize then, Since all values would be equal. – Appaji Chintimi Oct 26 '20 at 09:13
  • 3
    @AppajiChintimi, this solution applies to entire data, if you haven't done sanity check you could run into trouble. – hru_d Oct 29 '20 at 23:54
  • If you have numeric and non-numeric columns mixed, use `for feature_name in df.select_dtypes(include=['int', 'float']).columns:` to only normalize numeric columns – Caridorc Dec 26 '22 at 17:49
63

Your problem is actually a simple transform acting on the columns:

def f(s):
    return s/s.max()

frame.apply(f, axis=0)

Or even more terse:

   frame.apply(lambda x: x/x.max(), axis=0)
tschm
  • 2,905
  • 6
  • 33
  • 45
  • 5
    The `lambda` one is the best :-) – Abu Shoeb Dec 08 '18 at 23:49
  • 7
    isn't this supposed to be axis=1 since the question is column wise normalization? – Nguai al Apr 26 '19 at 23:27
  • 4
    No, from the [docs](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html): `axis [...] 0 or 'index': apply function to each column`. The default is actually `axis=0` so this one-liner can be written even shorter :-) Thanks @tschm. – gosuto Apr 11 '20 at 15:01
  • 2
    This is only correct if the min is 0, which isn't something that you should really assume – QFSW Nov 21 '20 at 17:19
  • My example was meant to illustrate how to apply functions on columns of dataframes. Obviously, as always, you need to pay attention to corner cases, e.g. here the max could be zero and result in an issue. Not sure I understand @QFSW. – tschm Nov 22 '20 at 18:28
43

If you like using the sklearn package, you can keep the column and index names by using pandas loc like so:

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler() 
scaled_values = scaler.fit_transform(df) 
df.loc[:,:] = scaled_values
j sad
  • 1,055
  • 9
  • 16
38

Take care with this answer, as it ONLY works for data that ranges [0, n]. This does not work for any range of data.


Simple is Beautiful:

df["A"] = df["A"] / df["A"].max()
df["B"] = df["B"] / df["B"].max()
df["C"] = df["C"] / df["C"].max()
Gulzar
  • 23,452
  • 27
  • 113
  • 201
Basil Musa
  • 8,198
  • 6
  • 64
  • 63
  • 8
    Note, that OP asked for [0..1] range and this solution scales to [-1..1] range. Try this with the array [-10, 10]. – Alexander Sosnovshchenko Apr 28 '18 at 09:20
  • 4
    @AlexanderSosnovshchenko not really. Basil Musa is assuming the OP's matrix is always non-negative, that's why he has given this solution. If some column has a negative entry then this code does NOT normalize to the [-1,1] range. Try it with the array [-5, 10]. The correct way to normalize to [0,1] with negative values was given by Cina's answer `df["A"] = (df["A"]-df["A"].min()) / (df["A"].max()-df["A"].min())` – Pepe Mandioca Nov 09 '18 at 13:24
  • Perhaps even simpler: `df /= df.max()` - assuming the goal is to normalise each and every column, individually. – n1k31t4 May 31 '20 at 22:26
  • 2
    **This answer is wrong.** The non negative assumption can't be made here, as not the OP not future readers stated it. Moreover, even strictly positive doesn't work here: `[1, 10]` will be normalized to `[0.1, 1]` instead of `[0,1]`. – Gulzar May 12 '21 at 11:53
  • 1
    Thanks @Gulzar, I'm the author of this answer and TBH I was surprised that it was upvoted 29 times. – Basil Musa May 18 '21 at 15:12
35

You can create a list of columns that you want to normalize

column_names_to_normalize = ['A', 'E', 'G', 'sadasdsd', 'lol']
x = df[column_names_to_normalize].values
x_scaled = min_max_scaler.fit_transform(x)
df_temp = pd.DataFrame(x_scaled, columns=column_names_to_normalize, index = df.index)
df[column_names_to_normalize] = df_temp

Your Pandas Dataframe is now normalized only at the columns you want


However, if you want the opposite, select a list of columns that you DON'T want to normalize, you can simply create a list of all columns and remove that non desired ones

column_names_to_not_normalize = ['B', 'J', 'K']
column_names_to_normalize = [x for x in list(df) if x not in column_names_to_not_normalize ]
raullalves
  • 836
  • 8
  • 20
16

I think that a better way to do that in pandas is just

df = df/df.max().astype(np.float64)

Edit If in your data frame negative numbers are present you should use instead

df = df/df.loc[df.abs().idxmax()].astype(np.float64)
Daniele
  • 553
  • 2
  • 12
14

The solution given by Sandman and Praveen is very well. The only problem with that if you have categorical variables in other columns of your data frame this method will need some adjustments.

My solution to this type of issue is following:

 from sklearn import preprocesing
 x = pd.concat([df.Numerical1, df.Numerical2,df.Numerical3])
 min_max_scaler = preprocessing.MinMaxScaler()
 x_scaled = min_max_scaler.fit_transform(x)
 x_new = pd.DataFrame(x_scaled)
 df = pd.concat([df.Categoricals,x_new])
Ozkan Serttas
  • 947
  • 13
  • 14
  • 3
    This answer is useful because most examples on the internet apply one scaler to all the columns, whereas this actually addresses the situation where one scaler, say the MinMaxScaler, should not apply to all columns. – demongolem Sep 10 '18 at 17:07
13

Normalize

You can use minmax_scale to transform each column to a scale from 0-1.

from sklearn.preprocessing import minmax_scale
df[:] = minmax_scale(df)

Standardize

You can use scale to center each column to the mean and scale to unit variance.

from sklearn.preprocessing import scale
df[:] = scale(df)

Column Subsets

Normalize single column

from sklearn.preprocessing import minmax_scale
df['a'] = minmax_scale(df['a'])

Normalize only numerical columns

import numpy as np
from sklearn.preprocessing import minmax_scale
cols = df.select_dtypes(np.number).columns
df[cols] = minmax_scale(df[cols])

Full Example

# Prep
import pandas as pd
import numpy as np
from sklearn.preprocessing import minmax_scale

# Sample data
df = pd.DataFrame({'a':[0,1,2], 'b':[-10,-30,-50], 'c':['x', 'y', 'z']})

# MinMax normalize all numeric columns
cols = df.select_dtypes(np.number).columns
df[cols] = minmax_scale(df[cols])

# Result
print(df)

#    a    b  c
# 0  0.0  1.0  x
# 2  0.5  0.5  y
# 3  1.0  0.0  z

Notes:

In all examples scale can be used instead of minmax_scale. Keeps index, column names or non-numerical variables unchanged. Function is applied for each column.

Caution:

For machine learning, use minmax_scale or scale after train_test_split to avoid data leakage.

Info

More info on standardization and normalization:

Benjamin Ziepert
  • 1,345
  • 1
  • 15
  • 19
12

You might want to have some of columns being normalized and the others be unchanged like some of regression tasks which data labels or categorical columns are unchanged So I suggest you this pythonic way (It's a combination of @shg and @Cina answers ):

features_to_normalize = ['A', 'B', 'C']
# could be ['A','B'] 

df[features_to_normalize] = df[features_to_normalize].apply(lambda x:(x-x.min()) / (x.max()-x.min()))
Masoud Masoumi Moghadam
  • 1,094
  • 3
  • 23
  • 45
11

It is only simple mathematics. The answer should as simple as below.

normed_df = (df - df.min()) / (df.max() - df.min())
Yuan
  • 462
  • 4
  • 12
10
df_normalized = df / df.max(axis=0)
Davoud Taghawi-Nejad
  • 16,142
  • 12
  • 62
  • 82
4

This is how you do it column-wise using list comprehension:

[df[col].update((df[col] - df[col].min()) / (df[col].max() - df[col].min())) for col in df.columns]
Chad
  • 1,434
  • 1
  • 15
  • 30
4

You can simply use the pandas.DataFrame.transform1 function in this way:

df.transform(lambda x: x/x.max())
antonjs
  • 14,060
  • 14
  • 65
  • 91
  • 3
    This solution won't work if all values are negative. Consider [-1, -2, -3]. We divide by -1, and now we have [1,2,3]. – Dave Liu Dec 05 '19 at 19:42
  • To properly handle negative numbers: df.transform(lambda x: x / abs(x).max()) – nvd Mar 25 '21 at 21:26
3
def normalize(x):
    try:
        x = x/np.linalg.norm(x,ord=1)
        return x
    except :
        raise
data = pd.DataFrame.apply(data,normalize)

From the document of pandas,DataFrame structure can apply an operation (function) to itself .

DataFrame.apply(func, axis=0, broadcast=False, raw=False, reduce=None, args=(), **kwds)

Applies function along input axis of DataFrame. Objects passed to functions are Series objects having index either the DataFrame’s index (axis=0) or the columns (axis=1). Return type depends on whether passed function aggregates, or the reduce argument if the DataFrame is empty.

You can apply a custom function to operate the DataFrame .

shg
  • 51
  • 5
  • 3
    It would be good to explain, why your code solves the OPs problem, so people can adapt the strategy rather than just copy your code. Please read [How do I write a good answer?](https://stackoverflow.com/help/how-to-answer) – Mr. T Apr 13 '18 at 09:53
3

The following function calculates the Z score:

def standardization(dataset):
  """ Standardization of numeric fields, where all values will have mean of zero 
  and standard deviation of one. (z-score)

  Args:
    dataset: A `Pandas.Dataframe` 
  """
  dtypes = list(zip(dataset.dtypes.index, map(str, dataset.dtypes)))
  # Normalize numeric columns.
  for column, dtype in dtypes:
      if dtype == 'float32':
          dataset[column] -= dataset[column].mean()
          dataset[column] /= dataset[column].std()
  return dataset
gogasca
  • 9,283
  • 6
  • 80
  • 125
2

New Scikit-Learn (Version>=1.2): Keeps DataFrame Column Names

In the new version of scikit-learn, it is now actually possible to keep the pandas column names intact even after the transform, below is an example:

>>> import pandas as pd
>>> from sklearn.preprocessing import MinMaxScaler, MaxAbsScaler

>>> df = pd.DataFrame({'col1':[1000, 765, 800], 'col2':[10, 5, 7], 'col3':[0.5, 0.35, 0.09]}, )
>>> df.head(3)
   col1  col2  col3
0  1000    10  0.50
1   765     5  0.35
2   800     7  0.09

>>> scaler = MaxAbsScaler().set_output(transform="pandas") #change here
>>> scaler.fit(df)
>>> df_scaled = scaler.transform(df)
>>> df_scaled.head(3)

   col1  col2  col3
0  1.000   1.0  1.00
1  0.765   0.5  0.70
2  0.800   0.7  0.18

I wrote a summary of the new updates here and you can also check the scikit-learn release highlights page.

Also, personally have never been a big fan of MaxAbsScaler, but I went with this one to answer op's question.

Hope this helps, cheers!!

Suvo
  • 1,318
  • 11
  • 13
0

You can do this in one line

DF_test = DF_test.sub(DF_test.mean(axis=0), axis=1)/DF_test.mean(axis=0)

it takes mean for each of the column and then subtracts it(mean) from every row(mean of particular column subtracts from its row only) and divide by mean only. Finally, we what we get is the normalized data set.

LOrD_ARaGOrN
  • 3,884
  • 3
  • 27
  • 49
0

Pandas does column wise normalization by default. Try the code below.

X= pd.read_csv('.\\data.csv')
X = (X-X.min())/(X.max()-X.min())

The output values will be in range of 0 and 1.

ahajib
  • 12,838
  • 29
  • 79
  • 120
faiz
  • 115
  • 1
  • 10
0

Hey use the apply function with lambda which speeds up the process:

def normalize(df_col):

  # Condition to exclude 'ID' and 'Class' feature
  if (str(df_col.name) != str('ID') and str(df_col.name)!=str('Class')):
        max_value = df_col.max()
        min_value = df_col.min()

        #It avoids NaN and return 0 instead
        if max_value == min_value:
          return 0

        sub_value = max_value - min_value
        return np.divide(np.subtract(df_col,min_value),sub_value)
  else:
        return df_col

 df_normalize = df.apply(lambda x :normalize(x))
-1

To normalise a DataFrame column, using only native Python.

Different values influence processes, e.g. plot colours.

Between 0 and 1:

min_val = min(list(df['col']))
max_val = max(list(df['col']))
df['col'] = [(x - min_val) / max_val for x in df['col']]

Between -1 to 1:

df['col'] = [float(i)/sum(df['col']) for i in df['col']]

OR

df['col'] = [float(tp) / max(abs(df['col'])) for tp in df['col']]
DanielBell99
  • 896
  • 5
  • 25
  • 57
-6

If your data is positively skewed, the best way to normalize is to use the log transformation:

df = np.log10(df)
Adrian Mole
  • 49,934
  • 160
  • 51
  • 83
amit haldar
  • 129
  • 1
  • 10