-2

I have a .csv file:

20376.65,22398.29,4.8,0.0,1.0,2394.0,6.1,89.1,0.0,4.027,9.377,0.33,0.28,0.36,51364.0,426372.0,888388.0,0.0,2040696.0,57.1,21.75,25.27,0.0,452.0,1046524.0,1046524.0,1
7048.842,8421.754,1.44,0.0,1.0,2394.0,29.14,69.5,0.0,4.027,9.377,0.33,0.28,0.36,51437.6,426964.0,684084.0,0.0,2040696.0,57.1,12.15,14.254,3.2,568.8,1046524.0,1046524.0,1
3716.89,4927.62,0.12,0.0,1.0,2394.0,26.58,73.32,0.0,4.027,9.377,0.586,1.056,3.544,51456.0,427112.0,633008.0,0.0,2040696.0,57.1,9.75,11.5,4.0,598.0,1046524.0,1046524.0,1
3716.89,4927.62,0.0,0.0,1.0,2394.0,17.653333333,82.346666667,0.0,4.027,9.377,0.84066666667,1.796,5.9346666667,51487.2,427268.0,481781.6,0.0,2040696.0,57.1,9.75,11.5,4.0,598.0,1046524.0,1046524.0,1
3716.89,4927.62,0.0,0.0,1.0,2394.0,16.6,83.4,0.0,4.027,9.377,0.87,1.88,6.18,51492.0,427292.0,458516.0,0.0,2040696.0,57.1,9.75,11.5,4.0,598.0,1046524.0,1046524.0,1

I am normalizing it using pandas dataframe, but i get missing values in .csv file:

.703280701968,0.867283950617,,,,0.0971635485818,-0.132770066385,,0.318518516666,-inf,-0.742913580247,-0.74703196347,-0.779350940252,-0.659592176966,-0.483438485804,0.565758716954,,,-inf,-0.274046377081,0.705774765311,-0.281481481478,-0.596841230258,,,1
0.104027493068,-0.0493827160494,,,,0.0199155099578,-0.0175015087508,,0.318518516666,-inf,-0.401580246914,-0.392694063927,-0.331530968381,-0.401165210674,-0.337539432177,0.426956186355,,,-inf,-0.373755558635,-0.294225234689,0.518518518522,-0.232751454697,,,1
0.104027493068,-0.132716049383,,,,-0.2494467914,0.254878294116,,0.318518516666,-inf,-0.0620246913541,-0.0547945205479,0.00470906912955,0.0370370365169,-0.183753943218,0.0159880797389,,,-inf,-0.373755558635,-0.294225234689,0.518518518522,-0.232751454697,,,1
0.104027493068,-0.132716049383,,,,-0.281231140616,0.286662643331,,0.318518516666,-inf,-0.0229135802474,-0.0164383561644,0.0392144605923,0.104452766854,-0.160094637224,-0.0472377828174,,,-inf,-0.373755558635,-0.294225234689,0.518518518522,-0.232751454697,,,1
0.104027493068,-0.132716049383,,,,-0.566083283042,0.571514785757,,0.318518516666,-inf,0.201086419753,0.199086757991,0.184362139917,0.104452766854,-0.160094637224,-0.0472377828174,,,-inf,-0.373755558635,-0.294225234689,0.518518518522,-0.232751454697,,,1

My code :

import pandas as pd


df = pd.read_csv('pooja.csv',index_col=False)
df_norm = (df.ix[:, 1:-1] - df.ix[:, 1:-1].mean()) / (df.ix[:, 1:-1].max() - df.ix[:, 1:-1].min())
rslt =  pd.concat([df_norm, df.ix[:,-1]], axis=1)
rslt.to_csv('example.csv',index=False,header=False)

What's wrong in code? Why values are missing in .csv file ?

Ankit G.
  • 71
  • 13
  • Why don't you print out the dataframes as it undergoes different steps in your code. That way you can identify which line of code is responsible. – Spinor8 Mar 06 '16 at 11:09
  • I printed df_norm it is giving 'nan' at all missing values but why? – Ankit G. Mar 06 '16 at 11:14
  • So the rest is irrelevant. Nan gets converted into blank spaces when you push it into a csv file. Now, I don't have access to your csv data file but I suspect your denominator might be giving a zero. Why don't you split up your df_norm calculation into two parts: df_numerator and df_denominator. Check whether for the df_denominator is zero for your data file. – Spinor8 Mar 06 '16 at 11:39

1 Answers1

1

You get many NaN, because divide 0 by 0. See broadcasting behaviour. Better explanation is here.

I use code from your previous question, because I think slicing with df.ix[:, 1:-1] is not necessary. After normalize with slicing I get empty DataFrame.

import pandas as pd
import numpy as np
import io

temp=u"""20376.65,22398.29,4.8,0.0,1.0,2394.0,6.1,89.1,0.0,4.027,9.377,0.33,0.28,0.36,51364.0,426372.0,888388.0,0.0,2040696.0,57.1,21.75,25.27,0.0,452.0,1046524.0,1046524.0,1
7048.842,8421.754,1.44,0.0,1.0,2394.0,29.14,69.5,0.0,4.027,9.377,0.33,0.28,0.36,51437.6,426964.0,684084.0,0.0,2040696.0,57.1,12.15,14.254,3.2,568.8,1046524.0,1046524.0,1
3716.89,4927.62,0.12,0.0,1.0,2394.0,26.58,73.32,0.0,4.027,9.377,0.586,1.056,3.544,51456.0,427112.0,633008.0,0.0,2040696.0,57.1,9.75,11.5,4.0,598.0,1046524.0,1046524.0,1
3716.89,4927.62,0.0,0.0,1.0,2394.0,17.653333333,82.346666667,0.0,4.027,9.377,0.84066666667,1.796,5.9346666667,51487.2,427268.0,481781.6,0.0,2040696.0,57.1,9.75,11.5,4.0,598.0,1046524.0,1046524.0,1
3716.89,4927.62,0.0,0.0,1.0,2394.0,16.6,83.4,0.0,4.027,9.377,0.87,1.88,6.18,51492.0,427292.0,458516.0,0.0,2040696.0,57.1,9.75,11.5,4.0,598.0,1046524.0,1046524.0,1"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp),index_col=None, header=None)
#print df
#filter only first 5 columns for testing
df = df.iloc[:, :5]
print df
           0          1     2  3  4
0  20376.650  22398.290  4.80  0  1
1   7048.842   8421.754  1.44  0  1
2   3716.890   4927.620  0.12  0  1
3   3716.890   4927.620  0.00  0  1
4   3716.890   4927.620  0.00  0  1

#get max values by columns
print df.max()
0    20376.65
1    22398.29
2        4.80
3        0.00
4        1.00
dtype: float64

#get min values by columns
print df.min()
0    3716.89
1    4927.62
2       0.00
3       0.00
4       1.00
dtype: float64
#difference, you get 0
print (df.max() - df.min())
0    16659.76
1    17470.67
2        4.80
3        0.00
4        0.00
dtype: float64

print df - df.mean()
            0           1      2  3  4
0  12661.4176  13277.7092  3.528  0  0
1   -666.3904   -698.8268  0.168  0  0
2  -3998.3424  -4192.9608 -1.152  0  0
3  -3998.3424  -4192.9608 -1.272  0  0
4  -3998.3424  -4192.9608 -1.272  0  0

#you get NaN, because divide columns 3 and 4 filled 0 to difference with index 3,4 filled 0
df_norm = (df - df.mean()) / (df.max() - df.min())
print df_norm
      0     1      2   3   4
0  0.76  0.76  0.735 NaN NaN
1 -0.04 -0.04  0.035 NaN NaN
2 -0.24 -0.24 -0.240 NaN NaN
3 -0.24 -0.24 -0.265 NaN NaN
4 -0.24 -0.24 -0.265 NaN NaN

Last if you generate to_csv, get from NaN "", because parameter na_rep has default value "":

print df_norm.to_csv(index=False, header=False, na_rep="")
0.76,0.76,0.735,,
-0.04,-0.04,0.035,,
-0.24,-0.24,-0.24,,
-0.24,-0.24,-0.265,,
-0.24,-0.24,-0.265,,

If you change value of na_rep:

#change na_rep to * for testing
print df_norm.to_csv(index=False, header=False, na_rep="*")
0.76,0.76,0.735,*,*
-0.04,-0.04,0.035,*,*
-0.24,-0.24,-0.24,*,*
-0.24,-0.24,-0.265,*,*
-0.24,-0.24,-0.265,*,*
Community
  • 1
  • 1
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252