2

I am running only the following three lines:

df = pd.read_hdf('data.h5')
print(df.mean())
print(df['derived_3'].mean())

The first print lists all of the individual means for each column, with one of these being

derived_3        -5.046012e-01

The second print gives the mean of just this column alone and is giving the result

-0.504715

Despite the difference in using the scientific notation and not, these values differ - why is this so?


Examples Using Other Methods

Performing the same with sum() results in the following:

derived_3        -7.878262e+05

-788004.0

Again, slightly different results, but count() returns the same results:

derived_3         1561285

1561285

Also, the result of df.head():

   id  timestamp  derived_0  derived_1  derived_2  derived_3  derived_4  \
0  10          0   0.370326  -0.006316   0.222831  -0.213030   0.729277   
1  11          0   0.014765  -0.038064  -0.017425   0.320652  -0.034134   
2  12          0  -0.010622  -0.050577   3.379575  -0.157525  -0.068550   
3  25          0        NaN        NaN        NaN        NaN        NaN   
4  26          0   0.176693  -0.025284  -0.057680   0.015100   0.180894   

   fundamental_0  fundamental_1  fundamental_2    ...     technical_36  \
0      -0.335633       0.113292       1.621238    ...         0.775208   
1       0.004413       0.114285      -0.210185    ...         0.025590   
2      -0.155937       1.219439      -0.764516    ...         0.151881   
3       0.178495            NaN      -0.007262    ...         1.035936   
4       0.139445      -0.125687      -0.018707    ...         0.630232   

   technical_37  technical_38  technical_39  technical_40  technical_41  \
0           NaN           NaN           NaN     -0.414776           NaN   
1           NaN           NaN           NaN     -0.273607           NaN   
2           NaN           NaN           NaN     -0.175710           NaN   
3           NaN           NaN           NaN     -0.211506           NaN   
4           NaN           NaN           NaN     -0.001957           NaN   

   technical_42  technical_43  technical_44         y  
0           NaN          -2.0           NaN -0.011753  
1           NaN          -2.0           NaN -0.001240  
2           NaN          -2.0           NaN -0.020940  
3           NaN          -2.0           NaN -0.015959  
4           NaN           0.0           NaN -0.007338  
piRSquared
  • 285,575
  • 57
  • 475
  • 624
KOB
  • 4,084
  • 9
  • 44
  • 88
  • Also, add `df.dtypes`? – Zero Oct 04 '17 at 19:34
  • Added to my post. It is a very large file, and as far as I know some of the number have ~20 decimal places, which arent's shown in results from pandas. Could this somehow be causing the problem? – KOB Oct 04 '17 at 19:34
  • Perhaps, see into https://stackoverflow.com/questions/22107928/numpy-sum-is-not-giving-right-answer-for-float32-type and https://stackoverflow.com/questions/41705764/numpy-sum-giving-strange-results-on-large-arrays – Zero Oct 04 '17 at 19:35

1 Answers1

4

pd.DataFrame method versus pd.Series method

In df.mean(), mean is pd.DataFrame.mean and operates on all columns as separate pd.Series. What is returned is a pd.Series in which df.columns is the new index and the means of each column are the values. In your initial example, df only has one column so the result is a length one series where the index was the name of that one column and the value was the mean for that one column.

In df['derived_3'].mean(), mean is pd.Series.mean and df['derived_3'] is a pd.Series. The result of pd.Series.mean will be a scalar.


Display Differences

The difference in display is because the result of df.mean is a pd.Series and the float format is controlled by pandas. On the other hand df['derived_3'].mean() is python primitive and isn't controlled by pandas.

import numpy as np
import pandas as pd

scalar

np.pi

3.141592653589793

pd.Series

pd.Series(np.pi)

0    3.141593
dtype: float64

with different formatting

with pd.option_context('display.float_format', '{:0.15f}'.format):
    print(pd.Series(np.pi))

0   3.141592653589793
dtype: float64

Reduction
It is useful to think of these various methods as either reducing the dimensionality or not. Or synonymously, aggregation or transformation.

  • reducing a pd.DataFrame results in a pd.Series
  • reducing a pd.Series results in a scalar

Methods That Reduce

  • mean
  • sum
  • std
piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • I understand. When you say "the difference in display", do you mean that the calculations are actually exactly correct both ways, and are just displayed differently, or if I interchanged my two examples when performing calculations, would this actually skew my results? – KOB Oct 04 '17 at 19:40
  • 1
    They are exactly the same. `3.14159265359` and the value inside `pd.Series(3.14159265359)` are the same. – piRSquared Oct 04 '17 at 19:44
  • @piRSquared One more question about this - I have this operation `df.ix[:, 2:-1] = df.ix[:, 2:-1] - df.ix[:, 2:-1].mean()`, which I am expecting to normalize all of the indexed columns so that their means are now 0. When I print out the means after executing this, they are all displaying as very small numbers, but none exactly as 0. Is there anyway I can check if my equation is correct and that the values actually are zero, or is my equation just wrong and they would display as 0 if they were? – KOB Oct 04 '17 at 20:24
  • When you are in the world of floating points, there is no such thing as exact. That very small number is close enough to zero. You can use `np.isclose` to determine if the floating point numbers are close, within some tolerance. You can use `round` to make equal to zero if you'd like. – piRSquared Oct 04 '17 at 20:27