3

First I used R implementation quantile regression, and after that I used Sklearn implementation with the same quantile (tau) and alpha=0.0 (regularization constant). I am getting the same formulas! I tried many "solvers" and still the running time is much longer than that of R.

Running time: Scikit-learn model vs R model

For example:

Example: 40672 samples

In R model the default method is "br", and in Sklearn is "lasso". although I changed the method of R implementation to "lasso" the running time just shorter.

Different methods

Import and create a Data:

import sklearn
print('sklearn version:', sklearn.__version__) # sklearn=1.0.1
import scipy
print('scipy version:', scipy.__version__) # scipy=1.7.2
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import time

from sklearn.linear_model import QuantileRegressor

from sklearn.base import BaseEstimator, RegressorMixin
from sklearn.metrics import r2_score
from sklearn.ensemble import BaggingRegressor
from rpy2.robjects.packages import importr
from rpy2.robjects import numpy2ri, pandas2ri

pandas2ri.activate() #activate conversion of Python pandas to R data structures
numpy2ri.activate() #activate conversion of Python numpy to R data structures

n_samples, n_features = 10000, 1
X = np.linspace(start=0.0,stop=2.0,num=n_samples).reshape((n_samples,n_features))
y = X+X*np.random.rand(n_samples,n_features)+1

X = pd.DataFrame(data=X, columns=['X'])
y = pd.DataFrame(data=y, columns=['y'])

Function for plot the data (with or without a line):

from typing import NoReturn, List
import matplotlib.lines as mlines

def ScatterPlot(X : np.ndarray, Y : np.ndarray, title : str = "Default", line_coef : List[int] = None)->NoReturn:
    print(line_coef)
    fig, ax = plt.subplots(figsize=(6, 6))
    ax.scatter(X, y, s=80, marker="P", c='green')
    xmin, xmax = ax.get_xbound()
    ymin, ymax = ax.get_ybound()
    plt.title(title)
    plt.xlabel("X")
    plt.ylabel("Y")
    ax.set(xlim=(xmin, xmax), ylim=(ymin, ymax))#, aspect='equal')
    ax.grid()
    if line_coef is not None:
        p1, p2 = [0, line_coef[0]], [1, sum(line_coef)] 
        ymax = p1[1] + (p2[1] - p1[1]) / (p2[0] - p1[0]) * (xmax - p1[0])
        ymin = p1[1] + (p2[1] - p1[1]) / (p2[0] - p1[0]) * (xmin - p1[0])
        ax.add_line(mlines.Line2D([xmin,xmax], [ymin,ymax], color='red'))
    plt.show()
    
ScatterPlot(X=X, Y=y)

Plot

Functions for getting the formulas:

def R_get_formula():
    return (str(coef_R[0]) + ' + ' + ' + '.join(
        ['{} * [{}]'.format(str(a), str(b)) for a, b in zip(coef_R[1:].tolist(), ['X'])]))    

def get_formula_from_sklearn(regressor):
    return (str(regressor.intercept_) + ' + ' + ' + '.join(
            ['{} * [{}]'.format(str(a), str(b)) for a, b in zip(regressor.coef_.tolist(), regressor.feature_names_in_)])) 

Fit the data and test the running time and the formulas:

tau=0.95

_quantreg = importr("quantreg")  #import quantreg package from R
################# QuantileRegression R #################
start = time.time()
model_R = _quantreg.rq(formula='{} ~ .'.format(y.columns[0]), tau=tau, data=pd.concat(
            [y.reset_index(drop=True), X.loc[y.index, :].reset_index(drop=True)], axis=1))
coef_R = numpy2ri.ri2py(model_R[0])
print('R tooks {} seconds to finish'.format(time.time()-start)) 
print("The formula is: {}".format(R_get_formula()))
print("Tau: {}".format(tau))
ScatterPlot(X=X, y=y, title="QuantileRegression - R",line_coef=coef_R)

################# QuantileRegression sklearn #################
start = time.time()
model_sklearn = QuantileRegressor(quantile=tau, alpha=0.0, solver='highs')
model_sklearn.fit(X, y)
print('Sklearn tooks {} seconds to finish'.format(time.time()-start)) 
print("The formula is: {}".format(get_formula_from_sklearn(model_sklearn)))
print("Tau: {}".format(tau))
ScatterPlot(X=X, y=y, title="QuantileRegression - sklearn",line_coef=[model_sklearn.intercept_] + list(model_sklearn.coef_))

R_model
Sklearn_model

Why its takes so much longer to fit model in sklearn then R model implementation?

  • Please show the code needed to produce both results, as well as the (example) data. – 9769953 Nov 24 '21 at 12:12
  • 1
    Perhaps statsmodels can be of use here, instead of scikit-learn; as an extra comparison. – 9769953 Nov 24 '21 at 12:15
  • I tried statsmodels before sklearn model but I didnt get the same formulas. (maybe because the features are not i.i.d). Scikit-learn has released a new version. The version includes Quantile Regression implementation - so why not to try?!. – Sapir Tubul Nov 24 '21 at 12:45
  • 2
    Your R code is Python? What are you comparing? What is `QuantileRegressionR`? Show relevant imports. – 9769953 Nov 24 '21 at 15:09
  • 2
    There is no use without data for us to try and reproduce your results. Please provide (public) data that produces your problem. Create a [mcve]. – 9769953 Nov 24 '21 at 15:10
  • 1
    what is QuantileRegressionR ??? – StupidWolf Nov 24 '21 at 20:09
  • As @StupidWolf implies, please also include the relevant library imports in both languages. – desertnaut Nov 25 '21 at 07:48
  • i'm having the same problem – seeker_after_truth Apr 02 '22 at 15:55
  • 1
    I think sklearn knew about this algorithm being slow as per the docs: "Method used by scipy.optimize.linprog to solve the linear programming formulation. Note that the highs methods are recommended for usage with scipy>=1.6.0 because they are the fastest ones." – Mauricio Maroto Apr 27 '22 at 05:27

2 Answers2

0

As suggested in the comments by Mauricio, changing the solver to HiGHS solver="highs" works for some cases (at least, it solved my problem in my case). Btw, this may require installation of the solver.

See here for the use of parameter

If your data set is a bit larger there is a reported issue in their Github repo.

berkorbay
  • 443
  • 7
  • 22
0

I have implemented the fast quantile regression in Python that uses interior point method. It also supports cluster robust standard error. Please check it out here: https://github.com/mozjay0619/pyqreg.

There you will find install instructions and some examples. If you do end up using it, please give the link some credit (give the repo a star! Took me a long time to make this) Cheers.

StatsNoob
  • 360
  • 1
  • 5
  • 15