lm in R vs statsmodels.api OLS in Python

Question

I get completely different results with the same datasets in R and Python. I cannot understand why it happens.

R:

library(RcppCNPy)
d <- npyLoad("/home/vvkovalchuk/bin/src/python/asks1.npy")
datas = npyLoad('/home/vvkovalchuk/bin/src/python/bids2.npy')

m <- lm(d ~ datas)
summary(m)

Python:

import time
import numpy
import statsmodels.api as sm
from math import log

Y = numpy.load('./asks1.npy', allow_pickle=True)
X = numpy.load('./bids2.npy', allow_pickle=True)

X3 = sm.add_constant(X)
res_ols = sm.OLS(Y, X3).fit()

print(res_ols.params)

What am I doing wrong?

Results:

R:

Call:
lm(formula = d ~ datas)

Residuals:
       Min         1Q     Median         3Q        Max 
-6.089e+06  8.797e+07  2.163e+08  2.179e+08  1.122e+10 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.561e+00  2.253e+06       0        1
datas        3.809e+03  2.164e+09       0        1

Residual standard error: 208100000 on 14639 degrees of freedom
Multiple R-squared:  0.2735,    Adjusted R-squared:  0.2735 
F-statistic:  5512 on 1 and 14639 DF,  p-value: < 2.2e-16

Python:

OLS Regression Results                            
Dep. Variable:                      y   R-squared:                  0.112
Model:                            OLS   Adj. R-squared:             0.112
Method:                 Least Squares   F-statistic:                 1846.
Date:                Thu, 25 Mar 2021   Prob (F-statistic):          0.00
Time:                        13:08:43   Log-Likelihood:         1.6948e+05
No. Observations:               14641   AIC:                    -3.390e+05
Df Residuals:                   14639   BIC:                    -3.389e+05
Df Model:                           1                                         
Covariance Type:            nonrobust                                         


             coef    std err          t      P>|t|      [0.025      0.975]
 ------------------------------------------------------------------------------
const        0.0004   3.07e-06    126.136      0.000       0.000   0.000
x1           0.1478      0.003     42.969      0.000       0.141   0.155

Omnibus:                     3251.130   Durbin-Watson:          0.004
Prob(Omnibus):                  0.000   Jarque-Bera (JB):       14606.605
Skew:                           1.019   Prob(JB):               0.00
Kurtosis:                       7.449   Cond. No.               1.83e+05

I also tried to swap arguments in OLS function. Still getting incorrect results. Could this be related to NAs?

You need to find the equivalent of `sm.add_constant` when using `lm`, if you try `sm.OLS(Y,X)` only do you get the same results as with `lm` ? — Trusky, Mar 25 '21 at 01:15
I need to use add_constant to add intercept. No, I got different results. — Vladimir Kovalchuk, Mar 25 '21 at 01:20
possible difference in float / double perhaps [here](https://github.com/eddelbuettel/rcppcnpy/issues/21) ... but just taking punt as can't reproduce your example. — user20650, Mar 25 '21 at 01:25
@VladimirKovalchuk Try then `lm(d ~ 1 + datas)` and let me know if it gets the same result, please include a reproducible example to better check if not. — Trusky, Mar 25 '21 at 01:27
So what are the results you are getting here? We can't see the output nor run the code. It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input that can be used to test and verify possible solutions. — MrFlick, Mar 25 '21 at 03:08
The standard error in R points to an issue. Have you compared the data after loading it into R to the python data ... is it the same? — user20650, Mar 25 '21 at 12:24
@user20650 I tried to save in txt and numpy format. No success. If I retrieve from dataset only first 4096 observations results the same in R and Python. If I retrieve 4097 observations results are different. — Vladimir Kovalchuk, Mar 25 '21 at 22:06

lm in R vs statsmodels.api OLS in Python

0 Answers0

Linked