0

I have a large dataset, I want to normalize all the columns in it, to have 100 on top of each column. I used the following code

df.apply(lambda x: (x  / x.iloc[0])*100)

But in some columns I have 0 values on top, that return nan. How can I amend the code in a way that get the first non-zero value not first row's value

this is a sample of my dataframe

 DataFrame using arrays.
import pandas as pd
  
# initialise data of lists.
data = {'marksA':[99, 98, 95, 80, 98, 95, 85],
       'marksB':[0, 0, 95, 80, 98, 95, 85],
       'marksC':[89, 98, 95, 83, 98, 95, 85]}
  
# Creates pandas DataFrame.
df = pd.DataFrame(data, index =['2000/01/01', '2001/01/01', '2002/01/01', '2003/01/01', '2004/01/01', '2005/01/01', '2006/01/01'])
  
# print the data
df

        marksA  marksB  marksC
2000/01/01  99  0   89
2001/01/01  98  0   98
2002/01/01  95  95  95
2003/01/01  80  80  83
2004/01/01  98  98  98
2005/01/01  95  95  95
2006/01/01  85  85  85


normalization = df.apply(lambda x: (x  / x.iloc[0])*100)
normalization

          marksA    marksB  marksC
2000/01/01  100.00  nan 100.00
2001/01/01  98.99   nan 110.11
2002/01/01  95.96   inf 106.74
2003/01/01  80.81   inf 93.26
2004/01/01  98.99   inf 110.11
2005/01/01  95.96   inf 106.74
2006/01/01  85.86   inf 95.51

K saman
  • 151
  • 6
  • How should looks expected ouput form column `marksB` ? – jezrael Apr 21 '21 at 05:50
  • If you really want to normalize each column, why get the first non-zero value, instead of simply max() of that column? And in pandas, you can simply do `df.apply(max, axis=0)`. Hence: `100 * df / df.apply(max, axis=0)` – smci Apr 21 '21 at 06:05
  • A near-duplicate of [Normalize columns of pandas data frame](https://stackoverflow.com/questions/26414913/normalize-columns-of-pandas-data-frame) – smci Apr 21 '21 at 08:19

2 Answers2

1

Idea is replace 0 by missing values and then backfilling them by bfill, here apply is not necessary, because is possible divide by first row like:

normalization = df.div(df.replace(0, np.nan).bfill().iloc[0]).mul(100)

print (normalization)
                marksA      marksB      marksC
2000/01/01  100.000000    0.000000  100.000000
2001/01/01   98.989899    0.000000  110.112360
2002/01/01   95.959596  100.000000  106.741573
2003/01/01   80.808081   84.210526   93.258427
2004/01/01   98.989899  103.157895  110.112360
2005/01/01   95.959596  100.000000  106.741573
2006/01/01   85.858586   89.473684   95.505618

Your code:

normalization = df.apply(lambda x: (x  / x.replace(0, np.nan).bfill().iloc[0])*100)

Like pointed @smci for vectorize normalization by max values use:

normalization = df.div(df.max()).mul(100)
print (normalization)
                marksA      marksB      marksC
2000/01/01  100.000000    0.000000   90.816327
2001/01/01   98.989899    0.000000  100.000000
2002/01/01   95.959596   96.938776   96.938776
2003/01/01   80.808081   81.632653   84.693878
2004/01/01   98.989899  100.000000  100.000000
2005/01/01   95.959596   96.938776   96.938776
2006/01/01   85.858586   86.734694   86.734694
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • thank you.it worked well for my original question. however it brought up a new question to me! I have to plot all of my columns as line plot, and as I normalized them to start from 100, by replacing NaN to Zero the lines drop to zero. could you advice me on this please, if possible – K saman Apr 21 '21 at 11:38
  • @Ksaman - Unfortuantely for ploting I know only some basics, so not idea, sorry. – jezrael Apr 21 '21 at 11:40
  • no problem. normalization = df.div(df.replace(0, np.nan).bfill().iloc[0]).mul(100) can I use bfill here to get the first two rows in "markB" as 100 not 0? if I can change zeros that come at the beginning of each column to 100 I will be super happy – K saman Apr 21 '21 at 11:56
  • 1
    normalization = normalization.replace([0], value=None, method='bfill') that should be the answer – K saman Apr 21 '21 at 12:10
1

If you really want to normalize (really, 'scale') each column, you should get the max() of that column (not get the first non-zero value and then assume the values are sorted in decreasing order).

In pandas, you can get each column's max by simply df.apply(max, axis=0)

Hence your solution is:

100. * df / df.apply(max, axis=0)

                marksA      marksB      marksC
2000/01/01  100.000000    0.000000   90.816327
2001/01/01   98.989899    0.000000  100.000000
2002/01/01   95.959596   96.938776   96.938776
2003/01/01   80.808081   81.632653   84.693878
2004/01/01   98.989899  100.000000  100.000000
2005/01/01   95.959596   96.938776   96.938776
2006/01/01   85.858586   86.734694   86.734694

(Note that the / operation (df.divide()) on a dataframe by a series is vectorized, so you don't need to declare a lambda.)

PS: to suppress the unwanted decimal places, you could do e.g. pd.options.display.float_format = '{:,.2f}'.format

smci
  • 32,567
  • 20
  • 113
  • 146
  • @jezrael: stop misrepresenting me. I said the `/` is vectorized. Otherwise if it was a scalar its RHS operand couldn't be a vector, could it? – smci Apr 21 '21 at 08:12