2

This is the continuation of my previous post on normalizing columns of a Pandas DataFrame with a specific condition for negative value.

The DataFrame I'm using is the following:

import numpy as np
import pandas as pd

df = pd.DataFrame({'key' : [111, 222, 333, 444, 555, 666, 777, 888, 999],
                   'score1' : [-1, 0, 2, -1, 7, 0, 15, 0, 1], 
                   'score2' : [2, 2, -1, 10, 0, 5, -1, 1, 0]})

print(df)

   key  score1  score2
0  111      -1       2
1  222       0       2
2  333       2      -1
3  444      -1      10
4  555       7       0
5  666       0       5
6  777      15      -1
7  888       0       1
8  999       1       0

The possible values for the score1 and score2 Series are -1 and all positive integers (including 0). My goal was to normalize both columns the following way:

  • If the value is equal to -1, then return a missing NaN value
  • Else, normalize the remaining positive integers on a range between 0 and 1.

I'm extremely happy with the solution from ezrael. That being said, I continued working on my problem to see if I could come up with an alternate solution. Here's my try:

  1. I'm defining the following function:
def normalize(x):
    if x == -1:
        return np.nan
    else:
        return x/x.max()
  1. I'm creating the new norm1 Series by applying the above function to the score1 Series:
df['norm1'] = df['score1'].apply(normalize)

Unfortunately, this raises the following AttributeError: 'int' object has no attribute 'max'.

I converted the score1 Series to float64 but it does not fix the problem: 'float' object has no attribute 'max'.

I also did a quick test by replacing the second ´return´ statement with return x/15 (15 being the maximum value of the score1 Series) and it worked:

   key  score1  score2     norm1
0  111    -1.0       2       NaN
1  222     0.0       2  0.000000
2  333     2.0      -1  0.133333
3  444    -1.0      10       NaN
4  555     7.0       0  0.466667
5  666     0.0       5  0.000000
6  777    15.0      -1  1.000000
7  888     0.0       1  0.000000
8  999     1.0       0  0.066667

But this is not a viable solution. I want to be able to divide by the maximum value of the Series instead of hard-coding it. WHY is my solution not working and HOW do I fix my code?

glpsx
  • 587
  • 1
  • 7
  • 21
  • You function is being called one time for each element of the Series. It has no (direct) access to the entire Series, which would be required to call `.max()` on it. – jasonharper Sep 09 '19 at 12:18

3 Answers3

4

The reason of AttributeError: 'float' object has no attribute 'max' error is that with your code you are calling the max() function on every (float) items of your column, you can pass the max value of your column to the normalize function:

def normalize(x, col_max):
    if x == -1:
        return np.nan
    else:
        return x/col_max

And edit the norm1 column creation code as follow:

df['norm1'] = df['score1'].apply(lambda x: normalize(x, df['score1'].max()))
FabioL
  • 932
  • 6
  • 22
  • @FavbioL thank you! should compute the max just once (df['score1'].max()) instead of per row, otherwise, there is a large performance penalty for large dataframes. – blue-sky Jul 02 '20 at 21:26
1

Another solution, using a function that takes a series as the input rather than a scalar:

import numpy as np
import pandas as pd

df = pd.DataFrame({'key' : [111, 222, 333, 444, 555, 666, 777, 888, 999],
                   'score1' : [-1, 0, 2, -1, 7, 0, 15, 0, 1],
                   'score2' : [2, 2, -1, 10, 0, 5, -1, 1, 0]})

df['norm1'] = df['score1'].replace(-1, np.nan)


def normalize_series(s):
    return (s - s.min()) / (s.max() - s.min())


df['norm1'] = normalize_series(df['norm1'])

As already mentioned, your version isn't working because you are trying to find the max of a single number, not a series.

Dan
  • 1,575
  • 1
  • 11
  • 17
1

It is important to understand what the "apply" function does : the 'x' argument of 'apply' is in fact a row (if you apply f on a pd.Dataframe object) or directly the only value of the row (if you are manipulating a pd.Series object).

You are in the second case. Imagine, instead of a pd.Series, that you have got a list.

L = [1,2,3,4,5]

def normalize(x):
    return(x/max(x))

normalize(L)

It is here clear that max(x) does not make any sense. What you are looking for is max(L).

So this would be technically okay :

L = [1,2,3,4,5]

def normalize(x):
    return(x/max(L))

normalize(L)

But not very efficient, since you are recomputing max(L) at every iteration. So

L = [1,2,3,4,5]
max_L = max(L)
def normalize(x,max_L):
    return(x/max_L)

normalize(L)

would be the answer you are looking for. With pd.Series, it gives

def normalize(x, col_max):
    if x == -1:
        return np.nan
    else:
        return x/col_max

df['norm1'] = df['score1'].apply(lambda x: normalize(x, df['score1'].max()))

Note that it is not necessary to replace NaNs by -1 to compute min() and max(), you just have to use nanmin() and nanmax(). You can separate the operations like this :

def create_nans(x):
    if x == -1:
        return np.nan
    else:
        return x

def normalize(x, col_max):
    return(x/col_max) # make sure col_max != 0 or NaN

df['score1'] = df['score1'].apply(create_nans)
df['norm1'].apply(lambda x: normalize(x, df['score1'].nanmax()))
Doe Jowns
  • 184
  • 1
  • 3
  • 12