1

I'm new to python and numpy library.I'm doing PCA on my custom dataset. I calculate the mean of each row of my dataframe from pandas but I get below result as mean array:

[   7.433148e+46
    7.433148e+47
    7.433148e+47
    7.433148e+46
    7.433148e+46
    7.433148e+46
    7.433148e+46
    7.433148e+45
    7.433148e+47]

And my code is :

   np.set_printoptions(precision=6)
   np.set_printoptions(suppress=False)
   df['mean']=df.mean(axis=1)
   mean_vector = np.array(df.iloc[:,15],dtype=np.float64)

  print('Mean Vector:\n', mean_vector)

what's the meaning of this numbers? and how should I remove e from the number?

Any help really appreciate, Thanks in advance.

hpaulj
  • 221,503
  • 14
  • 230
  • 353
Elmira Frhn
  • 320
  • 4
  • 14
  • `7.433148e+47` is `7.433148* (10^47)`, is this what you mean? What if you set `np.set_printoptions(suppress=True)`? – Mahdi Jan 14 '17 at 20:12
  • no ,I know the value . I use np.set_printoptions(suppress=True) to remove e from the numbers as they say in the numpy documents.https://docs.scipy.org/doc/numpy/reference/generated/numpy.set_printoptions.html but it didn't work. – Elmira Frhn Jan 14 '17 at 20:15
  • Well, you might want to look at this question and its answers: http://stackoverflow.com/questions/9777783/suppress-scientific-notation-in-numpy-when-creating-array-from-nested-list – Mahdi Jan 14 '17 at 20:17
  • thanks ,I've checked it before and then used np.set_printoptions(suppress=True) – Elmira Frhn Jan 14 '17 at 20:19
  • See [this post](http://stackoverflow.com/a/41645474/2336654) – piRSquared Jan 14 '17 at 20:19
  • thank you alot! it worked! @piRSquared – Elmira Frhn Jan 14 '17 at 20:22
  • I'm reopening this, because I don't think the problem is simply a formatting one. Something is fishy about a mean of 4e46 size. – hpaulj Jan 15 '17 at 07:41

1 Answers1

1

Are these large numbers realistic, and, if so how do you want to display them?

Copy and paste from your question:

In [1]: x=np.array([7.433148e+46,7.433148e+47])

The default numpy display adds a few decimal pts.

In [2]: x
Out[2]: array([  7.43314800e+46,   7.43314800e+47])

changing precision doesn't change much

In [5]: np.set_printoptions(precision=6)
In [6]: np.set_printoptions(suppress=True)

In [7]: x
Out[7]: array([  7.433148e+46,   7.433148e+47])

suppress does less. It supresses small floating point values, not large ones

suppress : bool, optional
Whether or not suppress printing of small floating point values using       
scientific notation (default False).

The default python display for one of these numbers - also scientific:

In [8]: x[0]
Out[8]: 7.4331480000000002e+46

With a formatting command we can display it in it's 46+ character glory (or gory detail):

In [9]: '%f'%x[0]
Out[9]: '74331480000000001782664341808476383296708673536.000000'

If that was a real value I'd prefer to see the scientific notation.

In [11]: '%.6g'%x[0]
Out[11]: '7.43315e+46'

To illustrate what suppress does, print the inverse of this array:

In [12]: 1/x
Out[12]: array([ 0.,  0.])

In [13]: np.set_printoptions(suppress=False)

In [14]: 1/x
Out[14]: array([  1.345325e-47,   1.345325e-48])

===============

I'm not that familiar with pandas, but I wonder if your mean calculation makes sense. What does pandas print for df.iloc[:,15]? For the mean to be this large, the original data has to have values of similar size. How does the source display them? I wonder if most of your values are smaller, normal values, and your have a few excessively large ones (outliers) that 'distort' the mean.

I think you can simplify the array extraction with values:

mean_vector = np.array(df.iloc[:,15],dtype=np.float64)
mean_vector = df.iloc[:,15].values
hpaulj
  • 221,503
  • 14
  • 230
  • 353
  • thanks for your reply.No my numbers aren't large ! I don't know how pandas calculated these mean numbers!As I read in their documents df.mean(axis=1), calculate mean for columns and this is one of my columns data : "83,403,8202,166,151712,189,29,4315,183,1234,2065,2016,4407129,,161096,4570,40636,43132,56822 – Elmira Frhn Jan 15 '17 at 06:59
  • There must be some value, may be several, that isn't what you expect - something much larger. Try `max` over rows or columns. – hpaulj Jan 15 '17 at 07:39
  • thank you so much! exactly i had a lot of repetitive numbers,now I normalized my dataframe using sklearn librarypreprocessing.MinMaxScaler() ,then calculate mean on columns and now I have mean values as below : 0.793219,0.799823,0.540168,0.074821,...,0.039899 – Elmira Frhn Jan 15 '17 at 08:53