This is the continuation of my previous post on normalizing columns of a Pandas DataFrame with a specific condition for negative value.
The DataFrame I'm using is the following:
import numpy as np
import pandas as pd
df = pd.DataFrame({'key' : [111, 222, 333, 444, 555, 666, 777, 888, 999],
'score1' : [-1, 0, 2, -1, 7, 0, 15, 0, 1],
'score2' : [2, 2, -1, 10, 0, 5, -1, 1, 0]})
print(df)
key score1 score2
0 111 -1 2
1 222 0 2
2 333 2 -1
3 444 -1 10
4 555 7 0
5 666 0 5
6 777 15 -1
7 888 0 1
8 999 1 0
The possible values for the score1
and score2
Series are -1
and all positive integers (including 0
). My goal was to normalize both columns the following way:
- If the value is equal to
-1
, then return a missingNaN
value - Else, normalize the remaining positive integers on a range between
0
and1
.
I'm extremely happy with the solution from ezrael. That being said, I continued working on my problem to see if I could come up with an alternate solution. Here's my try:
- I'm defining the following function:
def normalize(x):
if x == -1:
return np.nan
else:
return x/x.max()
- I'm creating the new
norm1
Series by applying the above function to thescore1
Series:
df['norm1'] = df['score1'].apply(normalize)
Unfortunately, this raises the following AttributeError: 'int' object has no attribute 'max'
.
I converted the score1
Series to float64
but it does not fix the problem: 'float' object has no attribute 'max'
.
I also did a quick test by replacing the second ´return´ statement with return x/15
(15
being the maximum value of the score1
Series) and it worked:
key score1 score2 norm1
0 111 -1.0 2 NaN
1 222 0.0 2 0.000000
2 333 2.0 -1 0.133333
3 444 -1.0 10 NaN
4 555 7.0 0 0.466667
5 666 0.0 5 0.000000
6 777 15.0 -1 1.000000
7 888 0.0 1 0.000000
8 999 1.0 0 0.066667
But this is not a viable solution. I want to be able to divide by the maximum value of the Series instead of hard-coding it. WHY is my solution not working and HOW do I fix my code?