I am doing feature scaling on my data and R and Python are giving me different answers in the scaling. R and Python give different answers for the many statistical values:
Median:
Numpy gives 14.948499999999999
with this code:np.percentile(X[:, 0], 50, interpolation = 'midpoint')
.
The built in Statistics
package in Python gives the same answer with the following code: statistics.median(X[:, 0])
.
On the other hand, R gives this results 14.9632
with this code: median(X[, 1])
. Interestingly, the summary()
function in R gives 14.960 as the median.
A similar difference occurs when computing the mean
of this same data. R gives 13.10936
using the built-in mean()
function and both Numpy and the Python Statistics package give 13.097945407088607
.
Again, the same thing happens when computing the Standard Deviation. R gives 7.390328
and Numpy (with DDOF = 1) gives 7.3927612774052083
. With DDOF = 0, Numpy gives 7.3927565984408936
.
The IQR also gives different results. Using the built-in IQR()
function in R, the given results is 12.3468
. Using Numpy with this code: np.percentile(X[:, 0], 75) - np.percentile(X[:, 0], 25)
the results is 12.358700000000002
.
What is going on here? Why are Python and R always giving different results? It may help to know that my data has 795066 rows and is being treated as an np.array()
in Python. The same data is being treated as a matrix
in R.